Decode the Q4, Q5, and Q8 labels on model files, understand what bits-per-weight actually costs you, and pick a quantization that fits your RAM without wrecking quality.
A deep read — the full picture, with the receipts.
By the end of this tutorial you will be able to look at a list of model files like Q4_K_M, Q5_K_M, and Q8_0 and pick the one that fits your hardware and your quality bar on purpose, not by guessing. You need no special tools, only a model that ships in quantized form (most GGUF and similar formats do) and a rough idea of how much RAM or VRAM you can spare.
A model's weights are numbers. By default they are stored as 16-bit or 32-bit floating-point values. Quantization stores those same weights using fewer bits each: 8 bits, 5 bits, 4 bits, sometimes fewer. Lower precision means each weight takes less memory and moves through the processor faster, which is why a quantized model loads on a laptop that could never hold the full-precision version.
The tradeoff is rounding error. Squeezing a weight into 4 bits throws away information, and that error accumulates across billions of parameters. Done well, the loss is small and barely shows up in output quality. Done too aggressively, the model starts producing weaker, less coherent text.
Quantized files carry a naming convention worth learning. Take Q4_K_M:
Q4 means roughly 4 bits per weight.
K means a k-quant scheme, which assigns precision per block of weights rather than uniformly. K-quants generally beat the older uniform methods at the same bit budget.
M is the size variant within that scheme: S for small, M for medium, L for large. Larger variants keep more precision on the weights that matter most.
So Q4_K_M is a 4-bit k-quant, medium variant. Q5_K_M is the 5-bit equivalent, larger on disk and a touch more faithful. Q8_0 is 8-bit, close to full precision and rarely worth it over the smaller k-quants for most uses.
You can predict a quantized file's footprint with simple arithmetic. Multiply the parameter count by the bits per weight, then divide by eight to get bytes:
bytes = parameters x bits_per_weight / 8
An 8B-parameter model at 4 bits is roughly 8e9 x 4 / 8 = 4 GB of weights. At 8 bits it is about 8 GB. On top of the weights you need headroom for the context window (the KV cache), which grows with how many tokens you feed in. Budget at least a couple of extra gigabytes beyond the raw weight size, more for long contexts.
For most people on most hardware, start at Q4_K_M. It is the widely used sweet spot: a 4-bit model is small enough to run on modest machines while keeping quality close to the full-precision original. From there, adjust:
If you have RAM or VRAM to spare and want the last few percent of quality, step up to Q5_K_M.
If the model barely fits or generation is too slow, step down to Q4_K_S or a smaller parameter count.
Reach for Q8_0 only when you are comparing against full precision or doing work where small quality differences matter a lot.
The single biggest quality lever is usually parameter count, not the last bit of quantization. A Q4_K_M 8B model will typically beat a Q8_0 3B model. Spend your memory on more parameters first, then on more bits.
Numbers on paper are a starting estimate. Confirm the fit by loading the model and watching three things: does it load without paging weights to disk, does generation run at a usable speed, and does the output quality hold up on your own prompts. If the model loads but answers feel degraded, move up one quantization step before blaming the model itself. If it will not fit, move down a step or shrink the parameter count.
The most common mistake is chasing higher bit counts when the real problem is too few parameters. Eight bits of a tiny model is still a tiny model. Decide on parameter count first, then choose the largest quantization that fits.
The second pitfall is forgetting the context window. A long prompt or a big retrieval payload inflates the KV cache, and a model that fit comfortably with short prompts can run out of memory once you push the context. If you plan to feed in large documents, leave more headroom than the weight math alone suggests.
Finally, do not assume a fixed quality penalty for 4-bit across all model sizes. Larger models tend to tolerate aggressive quantization better than small ones, because they have more redundancy to spare. Test on your actual workload rather than trusting a single rule of thumb.
Adapt a small open model to your task using LoRA: prepare a clean instruction dataset, train lightweight adapters, and know when fine-tuning is the wrong tool entirely.
Discussion