Question 1

How much VRAM do I need to fine-tune a 7B model?

Accepted Answer

It depends entirely on the method. A full fine-tune of a 7B model (mixed-precision Adam) needs roughly 18 GB per billion params for weights, gradients, optimizer states and a master copy — about 125 GB, so multiple data-center GPUs. LoRA drops that to ~17 GB because the base model stays frozen in 16-bit and you only train small adapters — it fits on a single 24 GB card like an RTX 4090. QLoRA quantizes the frozen base to 4-bit and needs only ~6 GB, so a 7B fine-tune runs on an 8–12 GB consumer GPU.

Question 2

Why does fine-tuning need so much more memory than inference?

Accepted Answer

Inference only stores the weights plus a small KV cache. Training adds three big costs on top: gradients (same size as the weights), optimizer state (Adam keeps two fp32 moments per parameter — that's 8 bytes/param alone), and activations saved for backpropagation. For a full fine-tune that's roughly 16–18 bytes per parameter versus ~2 for 16-bit inference — about 8× more.

Question 3

What's the difference between LoRA and QLoRA for VRAM?

Accepted Answer

Both freeze the base model and train small low-rank adapters, so the optimizer cost is tiny. The difference is how the frozen base is stored: LoRA keeps it in 16-bit (~2 bytes/param), QLoRA quantizes it to 4-bit (~0.5 bytes/param). That 4× cut on the largest memory component is why QLoRA can fine-tune a 70B model on a single 80 GB GPU, where LoRA would need two or three.

Question 4

Can I fine-tune a 70B model on a single GPU?

Accepted Answer

Not with full fine-tuning — that needs well over a terabyte of VRAM (many 80 GB GPUs). But QLoRA makes it possible: a 70B model in 4-bit plus adapters lands around 60 GB, which fits on a single 80 GB A100 or H100 (this is the headline result from the original QLoRA paper). On consumer hardware you'd step down to a 32B or smaller model.

Question 5

Does gradient checkpointing reduce the VRAM needed?

Accepted Answer

Yes — gradient (activation) checkpointing recomputes activations during the backward pass instead of storing them, which can cut activation memory substantially at the cost of ~20–30% more compute time. The estimates here assume a modest single-batch activation allowance without aggressive checkpointing; turning it on, and lowering batch size or sequence length, all reduce the real requirement.

Fine-Tuning VRAM Calculator

Frequently asked