Question 1

How much VRAM do I need to run a model locally?

Accepted Answer

Rule of thumb: multiply the model's parameter count (in billions) by the bytes per weight for your quantization — 2 GB/B for FP16, 1 GB/B for 8-bit, ~0.5 GB/B for 4-bit (GGUF Q4_K_M) — then add about 20% for the KV cache, activations and framework overhead. So an 8B model at 4-bit needs roughly 8 × 0.5 × 1.2 ≈ 5 GB, and a 70B model at 4-bit needs about 42–46 GB.

Question 2

Can I run a 70B model on an RTX 4090 (24 GB)?

Accepted Answer

Not on a single card at usable quality. Llama 3.3 70B at 4-bit (Q4_K_M) is about 39 GB of weights and ~42–46 GB with context, so it needs roughly 48 GB — typically two 24 GB cards (dual 3090/4090) or an Apple machine with 64 GB+ unified memory. On one 4090, run a 32B model (Qwen2.5 32B or DeepSeek-R1 32B) at 4-bit instead — that fits comfortably.

Question 3

What's the best LLM for a 16 GB GPU (RTX 4080 / 5080 / 4060 Ti 16GB)?

Accepted Answer

At 4-bit you comfortably fit 7B–14B models — Llama 3.1 8B, Qwen2.5 14B, Phi-4 14B, Gemma 3 12B — with room for long context. A 24B (Mistral Small 3) is tight but workable at short context; 27B (Gemma) is borderline. Note the 4060 Ti 16GB has a narrow memory bus, so it fits the same models but generates noticeably slower than a 4080/5080.

Question 4

Does Apple unified memory count as VRAM for LLMs?

Accepted Answer

Yes, and it's one of Apple Silicon's biggest advantages — the GPU shares the whole unified memory pool. But macOS reserves part of it for the system, so the practical budget for model weights is roughly 70–75% of the total. A 64 GB Mac gives ~48 GB usable (enough for a 70B at 4-bit), and 128 GB gives ~96 GB usable. You can raise the GPU memory cap with a sysctl tweak if you need more.

Question 5

FP16 vs 8-bit vs 4-bit — which quantization should I use?

Accepted Answer

For local chat, 4-bit (Q4_K_M) is what almost everyone runs: it cuts VRAM ~75% versus FP16 while losing only ~1.5–2 points on benchmarks. 8-bit is near-lossless (under ~1 point) but doubles the memory of 4-bit. FP16 is reference quality but rarely worth it locally — it's mostly for training/fine-tuning, not inference.

Question 6

Why does longer context need more VRAM?

Accepted Answer

Beyond the fixed weight memory, the model stores a KV cache for every token in the conversation, and that grows with context length. A short 4K chat adds only ~12% overhead, but a full 128K-token context can add 50–60% on top of the weights — which can push a model that 'fits' at 8K into 'won't fit' at 128K.

Local-LLM GPU Checker

Frequently asked