
Running models on your own machine is a memory problem first and a thermal problem second. Here is how to read a spec sheet for local inference instead of generic performance.
A deep read — the full picture, with the receipts.

Running models on your own machine is a memory problem first and a thermal problem second. Here is how to read a spec sheet for local inference instead of generic performance.
A deep read — the full picture, with the receipts.
Running AI models on your own laptop is a different buying problem from gaming or video editing, even though the parts overlap. The constraint that decides whether a model runs at all is memory, specifically the memory your accelerator can reach. Everything else affects how pleasant the experience is, not whether it happens.
This guide is for people who want to run models locally for privacy, offline use, cost, or experimentation. If that is you, read the spec sheet in a specific order.
A local model has to fit in memory to run well. The size of model you can load is set by how much memory your accelerator can use, so this is the first and most important number.
There are two architectures to understand. On machines with a discrete GPU, the model wants to live in the GPU's dedicated video memory, and that pool is usually smaller and separate from system RAM. On machines with a unified-memory design, the CPU and GPU share one large pool, so the accelerator can reach far more memory than a typical discrete GPU offers. For local inference, a large unified-memory pool often beats a faster discrete GPU with a small dedicated pool, because a model that does not fit has to spill into slower memory and crawls. Decide which architecture you are buying into before you compare anything else.

Once a model fits, how fast it generates text is governed largely by memory bandwidth, not raw compute. Inference reads a lot of weights for every token produced, so the rate at which the machine moves data through memory is the practical speed limit.
This is why two machines that both fit the same model can feel very different. The one with higher bandwidth produces tokens faster. When you compare options that both clear the memory-capacity bar, let bandwidth break the tie.
Local inference is a sustained load, not a quick burst. A thin laptop can post strong numbers for a minute, then throttle as it heats up. For anything beyond short prompts, the cooling system is part of the performance story.
Look for a chassis with real cooling headroom if you plan long sessions. A slightly thicker machine that holds its clocks under load will outperform a thinner one that throttles, even if their peak figures look similar. Battery life under this kind of load is also poor across the board, so assume you will run plugged in for serious work.
Model files are large, so storage fills quickly; size up if you intend to keep several models on hand. The CPU matters for loading and for any work that is not offloaded to the accelerator, but it is rarely the bottleneck for inference itself. Get something competent and move on.
The honest summary: for local AI, ask first whether the model fits, then how fast it runs, then whether that speed survives a long session. That order is memory capacity, memory bandwidth, and cooling. Get those three right and a modest machine will run real models comfortably. Get them wrong and the most expensive laptop will stutter on the thing you bought it for.

A 14-inch Apple-Silicon Pro laptop runs surprisingly large models on battery, and that one fact reshapes how a developer works day to day. The catch is what you pay, and what you give up, to get there.
Adil R. · Jun 1, 2026 · 4 min read
Discussion