Free tool
Local-LLM GPU Checker
Pick your GPU (or type your VRAM) and see exactly which open LLMs you can run locally — at which quantization, with a fit verdict and a rough speed feel. Built on vendor-verified VRAM and real GGUF sizes.
RTX 4090 (24 GB)
You can comfortably run up to Gemma 3 27B at 4-bit.
- Qwen2.5 32B32B · QwenTight fit4-bit ≈ 21.1 GB8-bit ≈ 38.4 GB ✕FP16 ≈ 76.8 GB ✕· ~20 tok/s (usable)
- DeepSeek-R1 Distill 32B32B · DeepSeekTight fit4-bit ≈ 21.1 GB8-bit ≈ 38.4 GB ✕FP16 ≈ 76.8 GB ✕· ~20 tok/s (usable)
- Gemma 3 27B27B · GemmaRuns well4-bit ≈ 17.8 GB8-bit ≈ 32.4 GB ✕FP16 ≈ 64.8 GB ✕· ~23 tok/s (usable)
- Mistral Small 3 24B24B · MistralRuns well4-bit ≈ 15.8 GB8-bit ≈ 28.8 GB ✕FP16 ≈ 57.6 GB ✕· ~26 tok/s (usable)
- Qwen2.5 14B14B · QwenRuns well4-bit ≈ 9.2 GB8-bit ≈ 16.8 GBFP16 ≈ 33.6 GB ✕· ~45 tok/s (fast)
- Phi-4 14B14B · PhiRuns well4-bit ≈ 9.2 GB8-bit ≈ 16.8 GBFP16 ≈ 33.6 GB ✕· ~45 tok/s (fast)
- DeepSeek-R1 Distill 14B14B · DeepSeekRuns well4-bit ≈ 9.2 GB8-bit ≈ 16.8 GBFP16 ≈ 33.6 GB ✕· ~45 tok/s (fast)
- Gemma 3 12B12B · GemmaRuns well4-bit ≈ 7.9 GB8-bit ≈ 14.4 GBFP16 ≈ 28.8 GB ✕· ~53 tok/s (fast)
- Gemma 2 9B9B · GemmaRuns well4-bit ≈ 5.9 GB8-bit ≈ 10.8 GBFP16 ≈ 21.6 GB· ~70 tok/s (fast)
- Llama 3.1 8B8B · LlamaRuns well4-bit ≈ 5.3 GB8-bit ≈ 9.6 GBFP16 ≈ 19.2 GB· ~79 tok/s (fast)
- Mistral 7B7.3B · MistralRuns well4-bit ≈ 4.8 GB8-bit ≈ 8.8 GBFP16 ≈ 17.5 GB· ~86 tok/s (fast)
- Qwen2.5 7B7B · QwenRuns well4-bit ≈ 4.6 GB8-bit ≈ 8.4 GBFP16 ≈ 16.8 GB· ~90 tok/s (fast)
- DeepSeek-R1 Distill 7B7B · DeepSeekRuns well4-bit ≈ 4.6 GB8-bit ≈ 8.4 GBFP16 ≈ 16.8 GB· ~90 tok/s (fast)
- Qwen2.5 72B72B · QwenWon't fit4-bit ≈ 47.5 GB ✕8-bit ≈ 86.4 GB ✕FP16 ≈ 172.8 GB ✕
- Llama 3.3 70B70B · LlamaWon't fit4-bit ≈ 46.2 GB ✕8-bit ≈ 84 GB ✕FP16 ≈ 168 GB ✕
- DeepSeek-R1 Distill 70B70B · DeepSeekWon't fit4-bit ≈ 46.2 GB ✕8-bit ≈ 84 GB ✕FP16 ≈ 168 GB ✕
Shopping for a card to run bigger models? See our GPU + mini-PC guides.
Estimates for DENSE models, anchored to real GGUF sizes: ≈ params × (0.5 GB at 4-bit, 1 GB at 8-bit, 2 GB at FP16) + ~12–60% for the KV cache at your context length. Tokens/sec are coarse bandwidth-class “feel” buckets (single user, short context) — not benchmarks; real speed depends on your engine (llama.cpp / Ollama / vLLM / MLX). Apple usable memory ≈ 72% of total. Re-verify before buying.
Frequently asked
- How much VRAM do I need to run a model locally?
- Rule of thumb: multiply the model's parameter count (in billions) by the bytes per weight for your quantization — 2 GB/B for FP16, 1 GB/B for 8-bit, ~0.5 GB/B for 4-bit (GGUF Q4_K_M) — then add about 20% for the KV cache, activations and framework overhead. So an 8B model at 4-bit needs roughly 8 × 0.5 × 1.2 ≈ 5 GB, and a 70B model at 4-bit needs about 42–46 GB.
- Can I run a 70B model on an RTX 4090 (24 GB)?
- Not on a single card at usable quality. Llama 3.3 70B at 4-bit (Q4_K_M) is about 39 GB of weights and ~42–46 GB with context, so it needs roughly 48 GB — typically two 24 GB cards (dual 3090/4090) or an Apple machine with 64 GB+ unified memory. On one 4090, run a 32B model (Qwen2.5 32B or DeepSeek-R1 32B) at 4-bit instead — that fits comfortably.
- What's the best LLM for a 16 GB GPU (RTX 4080 / 5080 / 4060 Ti 16GB)?
- At 4-bit you comfortably fit 7B–14B models — Llama 3.1 8B, Qwen2.5 14B, Phi-4 14B, Gemma 3 12B — with room for long context. A 24B (Mistral Small 3) is tight but workable at short context; 27B (Gemma) is borderline. Note the 4060 Ti 16GB has a narrow memory bus, so it fits the same models but generates noticeably slower than a 4080/5080.
- Does Apple unified memory count as VRAM for LLMs?
- Yes, and it's one of Apple Silicon's biggest advantages — the GPU shares the whole unified memory pool. But macOS reserves part of it for the system, so the practical budget for model weights is roughly 70–75% of the total. A 64 GB Mac gives ~48 GB usable (enough for a 70B at 4-bit), and 128 GB gives ~96 GB usable. You can raise the GPU memory cap with a sysctl tweak if you need more.
- FP16 vs 8-bit vs 4-bit — which quantization should I use?
- For local chat, 4-bit (Q4_K_M) is what almost everyone runs: it cuts VRAM ~75% versus FP16 while losing only ~1.5–2 points on benchmarks. 8-bit is near-lossless (under ~1 point) but doubles the memory of 4-bit. FP16 is reference quality but rarely worth it locally — it's mostly for training/fine-tuning, not inference.
- Why does longer context need more VRAM?
- Beyond the fixed weight memory, the model stores a KV cache for every token in the conversation, and that grows with context length. A short 4K chat adds only ~12% overhead, but a full 128K-token context can add 50–60% on top of the weights — which can push a model that 'fits' at 8K into 'won't fit' at 128K.