Question 1

What GPU do I need to run Llama 3.3 70B locally?

Accepted Answer

At 4-bit (Q4_K_M) Llama 3.3 70B needs roughly 42–46 GB including context, so no single consumer card runs it comfortably — you'll want two 24 GB cards (dual RTX 3090/4090) or an Apple machine with 64 GB+ unified memory (~48 GB usable). For a single card, drop to a 32B model (Qwen2.5 32B or DeepSeek-R1 32B), which fits a 24 GB 4090 at 4-bit.

Question 2

What's the best value GPU for running local LLMs?

Accepted Answer

For the most VRAM per dollar, a used RTX 3090 (24 GB) is the enthusiast favorite — it runs 32B models at 4-bit comfortably and pairs up for 70B. On a budget, the RTX 3060 12GB handles 7B–8B models well. Apple Silicon is excellent value if you already have a Mac with lots of unified memory, since the GPU shares the whole pool.

Question 3

Is more VRAM or more speed more important?

Accepted Answer

VRAM decides whether a model runs at all; memory bandwidth decides how fast. Get enough VRAM for the model + context first, then prioritize bandwidth. A 4060 Ti 16GB fits the same models as a 4080 but has a much narrower memory bus, so it generates noticeably slower on larger models.

Question 4

Can I run local LLMs on an AMD GPU or a Mac?

Accepted Answer

Yes. AMD RX 7900 XT/XTX (20–24 GB) run local LLMs via ROCm or Vulkan, though the software ecosystem is less mature than NVIDIA's CUDA. Apple Silicon runs them very well via MLX/llama.cpp, and its large unified memory lets even a laptop hold models that need multiple discrete GPUs — just budget ~72% of total memory as usable.

What GPU Do I Need?

Frequently asked