The instinct is to reach for the largest model available and call it done. That instinct is expensive and often wrong. Model size is a trade-off across four things you care about: quality on your task, latency per request, memory footprint, and running cost. A bigger model wins on raw capability and loses on the other three. The right size is the smallest one that clears your quality bar, not the largest one you can run. Here is how to decide.

Start with the task, not the model#

Tasks fall on a rough spectrum of how much capability they actually demand.

Narrow, well-defined tasks like classification, extraction, formatting, and routing rarely need a large model. A small model, given clear instructions, frequently matches a large one here.
Bounded reasoning like summarizing a document, rewriting in a fixed style, or answering from provided context sits in the middle. A mid-size model usually clears the bar.
Open-ended reasoning like multi-step problem solving, nuanced judgement, or generating from a blank page is where large models earn their cost.

The failure mode is assuming every task is open-ended reasoning. Most production tasks are narrow or bounded, and they are paying for capability they never use.

The four axes of the trade-off#

Quality#

Larger models are more capable in general, but the curve flattens. Past a certain size, a task that a mid model already does well gains little from a bigger one. The only way to know where your task sits on that curve is to evaluate candidates on your own data.

Latency#

Latency scales with size. A larger model produces tokens more slowly and takes longer to first response. For an interactive feature where a person is waiting, a fast small model that is slightly less capable often delivers a better experience than a slow large one.

Memory#

Size sets the floor on memory. A model has to fit, with room for its working state, or it does not run at all on your hardware. This axis is binary: it fits or it does not.

Cost#

Whether you pay per token through an API or in electricity and hardware locally, a larger model costs more per request. At low volume this is noise. At high volume it dominates everything else.

A decision framework#

Work through these in order.

Write the task and its quality bar. State plainly what the model must do and how good is good enough. This is the anchor for every later choice.
Start one size down from your instinct. If you assumed large, evaluate mid first. If you assumed mid, try small. You are looking for the floor, so start low.
Evaluate on your own data. Run the candidate against a real gold set and score it against the bar. Vibes do not count here.
If it clears the bar, stop. You found a model that does the job for less money and less latency. Going bigger now buys nothing you need.
If it falls short, look at the failures before sizing up. Many gaps close with a better prompt, an example or two, or supplying context, none of which require a bigger model.
Only then size up. When the failures are genuinely about capability and not instructions, move up one size and re-evaluate.

When bigger is actually right#

Reach for a large model when the task is genuinely open-ended, when errors are costly and rare correctness matters more than speed or price, or when you have evaluated smaller options and they measurably fail on your data. Those are real reasons. "It felt safer" is not.

When smaller wins#

Prefer the smaller model when latency is user-facing, when volume is high enough that per-request cost matters, when you want to run locally on constrained hardware, or when the task is narrow and a small model already clears the bar. In all of these, the smaller model is not a compromise. It is the better engineering choice.

The recurring decision#

Model sizing is not a one-time pick. Capability at a given size improves over time, so a task that needed a large model can later run on a smaller one. Keep your evaluation set, and revisit the size choice whenever a new option appears or your volume changes. The discipline is the same every time: find the smallest model that clears the bar, and resist paying for capability you do not use.

Choosing the right model size for your task

More in ai

The Real Cost of AI Is Inference, Not Training

Discussion