TutorialaiDeep read4 min read

How to choose the right quantization for a local LLM

Signal DeskMay 24, 2026Updated Jul 29, 2026

Decode the Q4, Q5, and Q8 labels on model files, understand what bits-per-weight actually costs you, and pick a quantization that fits your RAM without wrecking quality.

A deep read — the full picture, with the receipts.

Signalstrong2independent sources

By the end of this tutorial you will be able to look at a list of model files like Q4_K_M, Q5_K_M, and Q8_0 and pick the one that fits your hardware and your quality bar on purpose, not by guessing. You need no special tools, only a model that ships in quantized form (most GGUF and similar formats do) and a rough idea of how much RAM or VRAM you can spare.

Step 1: Understand what quantization is#

A model's weights are numbers. By default they are stored as 16-bit or 32-bit floating-point values. Quantization stores those same weights using fewer bits each: 8 bits, 5 bits, 4 bits, sometimes fewer. Lower precision means each weight takes less memory and moves through the processor faster, which is why a quantized model loads on a laptop that could never hold the full-precision version.

The tradeoff is rounding error. Squeezing a weight into 4 bits throws away information, and that error accumulates across billions of parameters. Done well, the loss is small and barely shows up in output quality. Done too aggressively, the model starts producing weaker, less coherent text.

Step 2: Read the labels#

Quantized files carry a naming convention worth learning. Take Q4_K_M:

Q4 means roughly 4 bits per weight.
K means a k-quant scheme, which assigns precision per block of weights rather than uniformly. K-quants generally beat the older uniform methods at the same bit budget.
M is the size variant within that scheme: S for small, M for medium, L for large. Larger variants keep more precision on the weights that matter most.

So Q4_K_M is a 4-bit k-quant, medium variant. Q5_K_M is the 5-bit equivalent, larger on disk and a touch more faithful. Q8_0 is 8-bit, close to full precision and rarely worth it over the smaller k-quants for most uses.

Step 3: Estimate the memory cost#

You can predict a quantized file's footprint with simple arithmetic. Multiply the parameter count by the bits per weight, then divide by eight to get bytes:

text

bytes = parameters x bits_per_weight / 8

An 8B-parameter model at 4 bits is roughly 8e9 x 4 / 8 = 4 GB of weights. At 8 bits it is about 8 GB. On top of the weights you need headroom for the context window (the KV cache), which grows with how many tokens you feed in. Budget at least a couple of extra gigabytes beyond the raw weight size, more for long contexts.

Step 4: Pick a starting point#

For most people on most hardware, start at Q4_K_M. It is the widely used sweet spot: a 4-bit model is small enough to run on modest machines while keeping quality close to the full-precision original. From there, adjust:

If you have RAM or VRAM to spare and want the last few percent of quality, step up to Q5_K_M.
If the model barely fits or generation is too slow, step down to Q4_K_S or a smaller parameter count.
Reach for Q8_0 only when you are comparing against full precision or doing work where small quality differences matter a lot.

The single biggest quality lever is usually parameter count, not the last bit of quantization. A Q4_K_M 8B model will typically beat a Q8_0 3B model. Spend your memory on more parameters first, then on more bits.

Step 5: Verify the choice in practice#

Numbers on paper are a starting estimate. Confirm the fit by loading the model and watching three things: does it load without paging weights to disk, does generation run at a usable speed, and does the output quality hold up on your own prompts. If the model loads but answers feel degraded, move up one quantization step before blaming the model itself. If it will not fit, move down a step or shrink the parameter count.

Where this breaks#

The most common mistake is chasing higher bit counts when the real problem is too few parameters. Eight bits of a tiny model is still a tiny model. Decide on parameter count first, then choose the largest quantization that fits.

The second pitfall is forgetting the context window. A long prompt or a big retrieval payload inflates the KV cache, and a model that fit comfortably with short prompts can run out of memory once you push the context. If you plan to feed in large documents, leave more headroom than the weight math alone suggests.

Finally, do not assume a fixed quality penalty for 4-bit across all model sizes. Larger models tend to tolerate aggressive quantization better than small ones, because they have more redundancy to spare. Test on your actual workload rather than trusting a single rule of thumb.

Frequently asked questions

What does Q4KM mean on a model file?

Q4 means roughly 4 bits per weight, K means a k-quant scheme that assigns precision per block of weights, and M is the medium size variant (S for small, M for medium, L for large). So Q4KM is a 4-bit k-quant, medium variant.

Which quantization should I start with?

For most people on most hardware, start at Q4KM. It is the widely used sweet spot: small enough to run on modest machines while keeping quality close to the full-precision original. Step up to Q5KM if you have memory to spare, or down to Q4KS if it barely fits.

How do I estimate how much memory a quantized model needs?

Multiply the parameter count by bits per weight, then divide by eight to get bytes. An 8B model at 4 bits is about 4 GB of weights; at 8 bits about 8 GB. Add at least a couple of extra gigabytes for the context window (KV cache), more for long contexts.

Is Q80 worth it over smaller k-quants?

Rarely. Q80 is 8-bit and close to full precision, but for most uses it is not worth it over the smaller k-quants. Reach for it only when comparing against full precision or doing work where small quality differences matter a lot.

What matters more, parameter count or quantization bits?

Parameter count is usually the bigger quality lever. A Q4KM 8B model will typically beat a Q80 3B model, so spend your memory on more parameters first, then on more bits.