Skip to content

Free tool

What GPU Do I Need?

Pick the open LLM you want to run and we’ll show the minimum GPU or Mac that runs it comfortably — plus every card that fits, cheapest-accessible first.

Needs ≈ 15.8 GB VRAM

Runs on RTX 4060 Ti 16GB (tight) · comfortable from AMD RX 7900 XT (20 GB).

  • Apple unified 16GB11 GB usable of 16 GB
    Too small
  • RTX 3060 12GB12 GB · budget
    Too small
  • RTX 4070 Ti (12 GB)12 GB · mid
    Too small
  • RTX 4060 Ti 16GB16 GB · budget (slow bus)
    Tight
  • RTX 4080 / Super (16 GB)16 GB · high
    Tight
  • RTX 5070 Ti (16 GB)16 GB · mid
    Tight
  • RTX 5080 (16 GB)16 GB · high
    Tight
  • Apple unified 24GB17 GB usable of 24 GB
    Tight
  • AMD RX 7900 XT (20 GB)20 GB · ROCm/Vulkan
    Runs well
  • AMD RX 7900 XTX (24 GB)24 GB · ROCm/Vulkan
    Runs well
  • RTX 3090 / Ti (24 GB)24 GB · used-value 24GB
    Runs well
  • RTX 4090 (24 GB)24 GB · flagship
    Runs well
  • Apple unified 36GB26 GB usable of 36 GB
    Runs well
  • RTX 5090 (32 GB)32 GB · flagship
    Runs well
  • Apple unified 48GB35 GB usable of 48 GB
    Runs well
  • Apple unified 64GB48 GB usable of 64 GB
    Runs well
  • Apple unified 128GB96 GB usable of 128 GB
    Runs well

“Comfortable” leaves ~15% headroom over the estimated need (params × bytes-per-weight × a context factor), anchored to real GGUF sizes. Apple usable memory ≈ 72% of total. Bandwidth matters for speed — a 4060 Ti 16GB fits the same models as a 4080 but runs them slower. Verify current prices in the guide before buying.

Frequently asked

What GPU do I need to run Llama 3.3 70B locally?
At 4-bit (Q4_K_M) Llama 3.3 70B needs roughly 42–46 GB including context, so no single consumer card runs it comfortably — you'll want two 24 GB cards (dual RTX 3090/4090) or an Apple machine with 64 GB+ unified memory (~48 GB usable). For a single card, drop to a 32B model (Qwen2.5 32B or DeepSeek-R1 32B), which fits a 24 GB 4090 at 4-bit.
What's the best value GPU for running local LLMs?
For the most VRAM per dollar, a used RTX 3090 (24 GB) is the enthusiast favorite — it runs 32B models at 4-bit comfortably and pairs up for 70B. On a budget, the RTX 3060 12GB handles 7B–8B models well. Apple Silicon is excellent value if you already have a Mac with lots of unified memory, since the GPU shares the whole pool.
Is more VRAM or more speed more important?
VRAM decides whether a model runs at all; memory bandwidth decides how fast. Get enough VRAM for the model + context first, then prioritize bandwidth. A 4060 Ti 16GB fits the same models as a 4080 but has a much narrower memory bus, so it generates noticeably slower on larger models.
Can I run local LLMs on an AMD GPU or a Mac?
Yes. AMD RX 7900 XT/XTX (20–24 GB) run local LLMs via ROCm or Vulkan, though the software ecosystem is less mature than NVIDIA's CUDA. Apple Silicon runs them very well via MLX/llama.cpp, and its large unified memory lets even a laptop hold models that need multiple discrete GPUs — just budget ~72% of total memory as usable.