Skip to content

Free tool

Local-LLM GPU Checker

Pick your GPU (or type your VRAM) and see exactly which open LLMs you can run locally — at which quantization, with a fit verdict and a rough speed feel. Built on vendor-verified VRAM and real GGUF sizes.

RTX 4090 (24 GB)

You can comfortably run up to Gemma 3 27B at 4-bit.

  • Qwen2.5 32B32B · Qwen
    Tight fit
    4-bit21.1 GB8-bit38.4 GBFP1676.8 GB· ~20 tok/s (usable)
  • DeepSeek-R1 Distill 32B32B · DeepSeek
    Tight fit
    4-bit21.1 GB8-bit38.4 GBFP1676.8 GB· ~20 tok/s (usable)
  • Gemma 3 27B27B · Gemma
    Runs well
    4-bit17.8 GB8-bit32.4 GBFP1664.8 GB· ~23 tok/s (usable)
  • Mistral Small 3 24B24B · Mistral
    Runs well
    4-bit15.8 GB8-bit28.8 GBFP1657.6 GB· ~26 tok/s (usable)
  • Qwen2.5 14B14B · Qwen
    Runs well
    4-bit9.2 GB8-bit16.8 GBFP1633.6 GB· ~45 tok/s (fast)
  • Phi-4 14B14B · Phi
    Runs well
    4-bit9.2 GB8-bit16.8 GBFP1633.6 GB· ~45 tok/s (fast)
  • DeepSeek-R1 Distill 14B14B · DeepSeek
    Runs well
    4-bit9.2 GB8-bit16.8 GBFP1633.6 GB· ~45 tok/s (fast)
  • Gemma 3 12B12B · Gemma
    Runs well
    4-bit7.9 GB8-bit14.4 GBFP1628.8 GB· ~53 tok/s (fast)
  • Gemma 2 9B9B · Gemma
    Runs well
    4-bit5.9 GB8-bit10.8 GBFP1621.6 GB· ~70 tok/s (fast)
  • Llama 3.1 8B8B · Llama
    Runs well
    4-bit5.3 GB8-bit9.6 GBFP1619.2 GB· ~79 tok/s (fast)
  • Mistral 7B7.3B · Mistral
    Runs well
    4-bit4.8 GB8-bit8.8 GBFP1617.5 GB· ~86 tok/s (fast)
  • Qwen2.5 7B7B · Qwen
    Runs well
    4-bit4.6 GB8-bit8.4 GBFP1616.8 GB· ~90 tok/s (fast)
  • DeepSeek-R1 Distill 7B7B · DeepSeek
    Runs well
    4-bit4.6 GB8-bit8.4 GBFP1616.8 GB· ~90 tok/s (fast)
  • Qwen2.5 72B72B · Qwen
    Won't fit
    4-bit47.5 GB8-bit86.4 GBFP16172.8 GB
  • Llama 3.3 70B70B · Llama
    Won't fit
    4-bit46.2 GB8-bit84 GBFP16168 GB
  • DeepSeek-R1 Distill 70B70B · DeepSeek
    Won't fit
    4-bit46.2 GB8-bit84 GBFP16168 GB

Shopping for a card to run bigger models? See our GPU + mini-PC guides.

Estimates for DENSE models, anchored to real GGUF sizes: ≈ params × (0.5 GB at 4-bit, 1 GB at 8-bit, 2 GB at FP16) + ~12–60% for the KV cache at your context length. Tokens/sec are coarse bandwidth-class “feel” buckets (single user, short context) — not benchmarks; real speed depends on your engine (llama.cpp / Ollama / vLLM / MLX). Apple usable memory ≈ 72% of total. Re-verify before buying.

Frequently asked

How much VRAM do I need to run a model locally?
Rule of thumb: multiply the model's parameter count (in billions) by the bytes per weight for your quantization — 2 GB/B for FP16, 1 GB/B for 8-bit, ~0.5 GB/B for 4-bit (GGUF Q4_K_M) — then add about 20% for the KV cache, activations and framework overhead. So an 8B model at 4-bit needs roughly 8 × 0.5 × 1.2 ≈ 5 GB, and a 70B model at 4-bit needs about 42–46 GB.
Can I run a 70B model on an RTX 4090 (24 GB)?
Not on a single card at usable quality. Llama 3.3 70B at 4-bit (Q4_K_M) is about 39 GB of weights and ~42–46 GB with context, so it needs roughly 48 GB — typically two 24 GB cards (dual 3090/4090) or an Apple machine with 64 GB+ unified memory. On one 4090, run a 32B model (Qwen2.5 32B or DeepSeek-R1 32B) at 4-bit instead — that fits comfortably.
What's the best LLM for a 16 GB GPU (RTX 4080 / 5080 / 4060 Ti 16GB)?
At 4-bit you comfortably fit 7B–14B models — Llama 3.1 8B, Qwen2.5 14B, Phi-4 14B, Gemma 3 12B — with room for long context. A 24B (Mistral Small 3) is tight but workable at short context; 27B (Gemma) is borderline. Note the 4060 Ti 16GB has a narrow memory bus, so it fits the same models but generates noticeably slower than a 4080/5080.
Does Apple unified memory count as VRAM for LLMs?
Yes, and it's one of Apple Silicon's biggest advantages — the GPU shares the whole unified memory pool. But macOS reserves part of it for the system, so the practical budget for model weights is roughly 70–75% of the total. A 64 GB Mac gives ~48 GB usable (enough for a 70B at 4-bit), and 128 GB gives ~96 GB usable. You can raise the GPU memory cap with a sysctl tweak if you need more.
FP16 vs 8-bit vs 4-bit — which quantization should I use?
For local chat, 4-bit (Q4_K_M) is what almost everyone runs: it cuts VRAM ~75% versus FP16 while losing only ~1.5–2 points on benchmarks. 8-bit is near-lossless (under ~1 point) but doubles the memory of 4-bit. FP16 is reference quality but rarely worth it locally — it's mostly for training/fine-tuning, not inference.
Why does longer context need more VRAM?
Beyond the fixed weight memory, the model stores a KV cache for every token in the conversation, and that grows with context length. A short 4K chat adds only ~12% overhead, but a full 128K-token context can add 50–60% on top of the weights — which can push a model that 'fits' at 8K into 'won't fit' at 128K.