
The hardware question for running models locally comes down to memory and bandwidth more than raw compute. A framework for picking the right path.
A deep read — the full picture, with the receipts.

The hardware question for running models locally comes down to memory and bandwidth more than raw compute. A framework for picking the right path.
A deep read — the full picture, with the receipts.
When people ask what hardware they need to run models locally, they usually ask about compute speed. That is the wrong first question. For running models, the binding constraints are memory capacity and memory bandwidth. A model has to fit in memory to run at all, and how fast that memory feeds the processor largely sets how fast it generates. Compute matters, but it is rarely what stops you. Here is how to think about the three paths: CPU, discrete GPU, and unified memory.
Two numbers decide most of the outcome.
Keep both in mind as you read the three options, because each one trades them differently.

Every machine can do this, and system memory is cheap to expand, so capacity is rarely the problem. The catch is bandwidth: ordinary system memory has far less of it than dedicated graphics memory, so generation is slow. CPU inference is a reasonable fit for small models, for batch jobs where you do not mind waiting, or for getting started before committing money. It is a poor fit for anything interactive at larger sizes.
A dedicated GPU pairs strong compute with high-bandwidth graphics memory, which is why it is the default for serious local AI. The decisive number is VRAM. The model has to fit in the card's memory, and that capacity, not the compute, is usually what caps the size you can run. The trade-off is that VRAM is comparatively expensive and fixed once you buy the card. You can sometimes split a model across memory tiers, but the moment significant weights live outside VRAM, speed drops sharply. Buy for the memory you need, not the headline compute number.
Unified-memory systems share one large, reasonably fast memory pool between CPU and integrated GPU. The advantage is capacity: because the pool is large, these systems can hold bigger models than a typical discrete card's VRAM, without splitting across tiers. The trade-off is bandwidth that usually sits between system memory and dedicated graphics memory, so peak generation speed is good but not class-leading. Memory is also generally fixed at purchase, so you size it up front. This path shines when you want to run larger models on one machine without buying a high-VRAM card.
The honest summary: decide by memory, not by compute. Confirm the model fits, weight bandwidth by how interactive your use is, and buy more memory than today's model strictly needs. Do that and the hardware stops being the thing that limits what you can run.

Putting the CPU, GPU, and memory close together reshaped how modern chips perform. It also made old spec comparisons unreliable.
BitByteCore Research · Jun 2, 2026 · 3 min read
Discussion