Skip to content

Free tool

Fine-Tuning VRAM Calculator

How much GPU memory does it take to fine-tune an LLM? Pick a model size and see full fine-tuning, LoRA, and QLoRA side by side — with the minimum GPU each one needs. Training memory is a different world from inference, and the method you pick is the whole story.

Fine-tuning a 7B model

QLoRA fits on a single consumer GPU (~6 GB) — full fine-tuning needs 2× 80 GB (A100 / H100).

  • Full fine-tune
    126 GB

    Every weight trained (Adam, mixed-precision). Best quality, brutally heavy — usually multi-GPU.

    Minimum: 2× 80 GB (A100 / H100)

  • LoRA
    16.8 GB

    Base frozen in 16-bit; train small adapters. ~10–20× lighter than full, near-full quality for most tasks.

    Minimum: 24 GB GPU (RTX 4090 / 3090)

  • QLoRA
    6 GB

    Base quantized to 4-bit + adapters. The consumer-GPU sweet spot — fine-tune big models on one card.

    Minimum: 8 GB GPU (RTX 4060 / 3070)

Just want to run a model, not train it?

Estimates in bytes-per-parameter, mixed-precision: full ≈ 18 (weights + grads + Adam m/v + fp32 master + overhead), LoRA ≈ 2.4 (frozen 16-bit base dominates), QLoRA ≈ 0.85 (4-bit base + paged adapter optimizer). These include a modest single-batch activation allowance — real VRAM moves with batch size, sequence length and gradient checkpointing (which can cut activation memory a lot at some speed cost). Dense models; not a benchmark. Re-verify against your framework before committing hardware.

Frequently asked

How much VRAM do I need to fine-tune a 7B model?
It depends entirely on the method. A full fine-tune of a 7B model (mixed-precision Adam) needs roughly 18 GB per billion params for weights, gradients, optimizer states and a master copy — about 125 GB, so multiple data-center GPUs. LoRA drops that to ~17 GB because the base model stays frozen in 16-bit and you only train small adapters — it fits on a single 24 GB card like an RTX 4090. QLoRA quantizes the frozen base to 4-bit and needs only ~6 GB, so a 7B fine-tune runs on an 8–12 GB consumer GPU.
Why does fine-tuning need so much more memory than inference?
Inference only stores the weights plus a small KV cache. Training adds three big costs on top: gradients (same size as the weights), optimizer state (Adam keeps two fp32 moments per parameter — that's 8 bytes/param alone), and activations saved for backpropagation. For a full fine-tune that's roughly 16–18 bytes per parameter versus ~2 for 16-bit inference — about 8× more.
What's the difference between LoRA and QLoRA for VRAM?
Both freeze the base model and train small low-rank adapters, so the optimizer cost is tiny. The difference is how the frozen base is stored: LoRA keeps it in 16-bit (~2 bytes/param), QLoRA quantizes it to 4-bit (~0.5 bytes/param). That 4× cut on the largest memory component is why QLoRA can fine-tune a 70B model on a single 80 GB GPU, where LoRA would need two or three.
Can I fine-tune a 70B model on a single GPU?
Not with full fine-tuning — that needs well over a terabyte of VRAM (many 80 GB GPUs). But QLoRA makes it possible: a 70B model in 4-bit plus adapters lands around 60 GB, which fits on a single 80 GB A100 or H100 (this is the headline result from the original QLoRA paper). On consumer hardware you'd step down to a 32B or smaller model.
Does gradient checkpointing reduce the VRAM needed?
Yes — gradient (activation) checkpointing recomputes activations during the backward pass instead of storing them, which can cut activation memory substantially at the cost of ~20–30% more compute time. The estimates here assume a modest single-batch activation allowance without aggressive checkpointing; turning it on, and lowering batch size or sequence length, all reduce the real requirement.