Bigger is not automatically better. A decision framework for matching model size to the job, the latency budget, and the hardware you actually have.
A deep read — the full picture, with the receipts.
The instinct is to reach for the largest model available and call it done. That instinct is expensive and often wrong. Model size is a trade-off across four things you care about: quality on your task, latency per request, memory footprint, and running cost. A bigger model wins on raw capability and loses on the other three. The right size is the smallest one that clears your quality bar, not the largest one you can run. Here is how to decide.
Tasks fall on a rough spectrum of how much capability they actually demand.
The failure mode is assuming every task is open-ended reasoning. Most production tasks are narrow or bounded, and they are paying for capability they never use.
Larger models are more capable in general, but the curve flattens. Past a certain size, a task that a mid model already does well gains little from a bigger one. The only way to know where your task sits on that curve is to evaluate candidates on your own data.
Latency scales with size. A larger model produces tokens more slowly and takes longer to first response. For an interactive feature where a person is waiting, a fast small model that is slightly less capable often delivers a better experience than a slow large one.
Size sets the floor on memory. A model has to fit, with room for its working state, or it does not run at all on your hardware. This axis is binary: it fits or it does not.
Whether you pay per token through an API or in electricity and hardware locally, a larger model costs more per request. At low volume this is noise. At high volume it dominates everything else.
Work through these in order.
Reach for a large model when the task is genuinely open-ended, when errors are costly and rare correctness matters more than speed or price, or when you have evaluated smaller options and they measurably fail on your data. Those are real reasons. "It felt safer" is not.
Prefer the smaller model when latency is user-facing, when volume is high enough that per-request cost matters, when you want to run locally on constrained hardware, or when the task is narrow and a small model already clears the bar. In all of these, the smaller model is not a compromise. It is the better engineering choice.
Model sizing is not a one-time pick. Capability at a given size improves over time, so a task that needed a large model can later run on a smaller one. Keep your evaluation set, and revisit the size choice whenever a new option appears or your volume changes. The discipline is the same every time: find the smallest model that clears the bar, and resist paying for capability you do not use.

Most production AI tasks are routine, and a new class of small models handles them at a fraction of the cost. The frontier models are becoming the exception, not the default.
Adil R. · Jun 13, 2026 · 3 min read
Discussion