Choosing hardware for local AI: CPU, GPU, or unified memory

Muniba K.May 17, 20264 min

The hardware question for running models locally comes down to memory and bandwidth more than raw compute. A framework for picking the right path.

A deep read — the full picture, with the receipts.

Discussion

Loading…

Why memory comes first#

Two numbers decide most of the outcome.

Capacity is whether the model fits. If the weights plus working state exceed available memory, the model does not run, or it spills to slower storage and crawls. This is a hard wall.

Bandwidth is how fast memory feeds the processor. Generating output reads through the model's weights repeatedly, so the speed of that memory path is usually the ceiling on tokens per second, more than the processor's compute.

Keep both in mind as you read the three options, because each one trades them differently.

The three paths#

CPU and system memory#

Every machine can do this, and system memory is cheap to expand, so capacity is rarely the problem. The catch is bandwidth: ordinary system memory has far less of it than dedicated graphics memory, so generation is slow. CPU inference is a reasonable fit for small models, for batch jobs where you do not mind waiting, or for getting started before committing money. It is a poor fit for anything interactive at larger sizes.

Discrete GPU#

A dedicated GPU pairs strong compute with high-bandwidth graphics memory, which is why it is the default for serious local AI. The decisive number is VRAM. The model has to fit in the card's memory, and that capacity, not the compute, is usually what caps the size you can run. The trade-off is that VRAM is comparatively expensive and fixed once you buy the card. You can sometimes split a model across memory tiers, but the moment significant weights live outside VRAM, speed drops sharply. Buy for the memory you need, not the headline compute number.

Unified memory#

Unified-memory systems share one large, reasonably fast memory pool between CPU and integrated GPU. The advantage is capacity: because the pool is large, these systems can hold bigger models than a typical discrete card's VRAM, without splitting across tiers. The trade-off is bandwidth that usually sits between system memory and dedicated graphics memory, so peak generation speed is good but not class-leading. Memory is also generally fixed at purchase, so you size it up front. This path shines when you want to run larger models on one machine without buying a high-VRAM card.

A decision framework#

Find your model's memory footprint. Before choosing hardware, know roughly how much memory your target model needs, including working state. This single number drives the rest.

Filter by capacity first. Eliminate any option that cannot hold the model. A faster device that does not fit the model is useless for it.

Then rank by bandwidth for your usage. If responses are user-facing, weight bandwidth heavily, since it sets generation speed. If you run batch jobs and can wait, weight it less.

Match the path to the pattern. Small models or patient batch work: CPU is fine and cheapest. Interactive use at common sizes: a discrete GPU with enough VRAM. Larger models on one box without a high-VRAM card: unified memory.

Size memory for tomorrow, not just today. GPU VRAM and unified memory are fixed at purchase. Buying tight on memory is the regret people report most, because you cannot add more later.

Common mistakes#

Optimizing for compute numbers. The headline teraflops rarely decide local AI outcomes. Capacity and bandwidth do.

Buying tight on memory. A card or system that just barely fits today's model leaves no room for a slightly larger one tomorrow, and you cannot expand it.

Ignoring the spill cliff. Splitting a model so part of it sits outside fast memory does not degrade gracefully. It falls off a cliff. Plan to fit the model in fast memory.

Forgetting working state. The weights are not the whole footprint. Leave headroom for the runtime's working memory or you will hit out-of-memory at the worst time.

The honest summary: decide by memory, not by compute. Confirm the model fits, weight bandwidth by how interactive your use is, and buy more memory than today's model strictly needs. Do that and the hardware stops being the thing that limits what you can run.

Choosing hardware for local AI: CPU, GPU, or unified memory

More in hardware

The single big-VRAM GPU desktop as an inference machine