
Install Ollama, pull a model, and chat with it offline in about ten minutes. No cloud account, no API key, and nothing leaves your laptop.
Step-by-step — built to follow along.
By the end of this tutorial you will have a working large language model running entirely on your own computer, reachable from both a terminal and a local HTTP endpoint. You need a machine with at least 8 GB of RAM (16 GB is more comfortable), a few gigabytes of free disk space, and a working internet connection for the one-time model download. A GPU helps but is not required. Ollama wraps the llama.cpp inference engine, so everything here runs on CPU if you have no graphics card.
Ollama ships installers for macOS, Windows, and Linux. On macOS and Windows you download the app and run it. On Linux the one-line script is the fastest path:
curl -fsSL https://ollama.com/install.sh | sh
When it finishes, confirm the binary is on your PATH:
ollama --version
If that prints a version string, the background service is already running and listening on localhost.
Models live in a registry, much like container images. Pulling one downloads the weights to your disk once, then caches them. Start a small instruction-tuned model so the download stays modest:
ollama run llama3.2
The first invocation downloads the weights, then drops you into an interactive prompt. Type a question and you get a reply generated locally. Use a triple-quote to send multi-line input, and type /bye to exit. The model stays cached, so the next ollama run starts almost instantly.
Pick a model whose size fits your RAM. A rough rule: a 4-bit quantized model needs roughly its parameter count in gigabytes of memory. A 3B model wants about 3 to 4 GB free; an 8B model wants closer to 8 GB. If your machine swaps hard or the reply crawls out one token at a time, drop to a smaller model.
Over time you will collect several models. These commands keep the set tidy:
ollama list
ollama rm llama3.2
list shows each model with its on-disk size so you can reclaim space, and rm deletes weights you no longer want.
The interactive prompt is convenient, but the real power is the local server Ollama runs on port 11434. Any program on your machine can hit it. A plain curl request looks like this:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain a hash map in two sentences.",
"stream": false
}'
The response is JSON with the generated text in a response field. Setting "stream": true instead returns tokens as they are produced, which is what you want for a chat UI. Because the endpoint speaks a stable JSON shape, you can wire it into scripts, editor plugins, or a small web app without any external dependency.
Loading weights into memory takes a few seconds. If you are calling the API repeatedly, that cold start adds up. Send an empty prompt to load the model and keep it resident:
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": "30m"}'
The keep_alive field tells Ollama how long to hold the model in RAM after the last request. Raise it for an interactive session, lower it to free memory between bursts.
The most common failure is memory pressure. If a model is larger than your free RAM, the OS pages weights to disk and generation slows to a near halt, or the process is killed outright. Watch system memory and step down a model size before you blame the tool.
The second trap is expecting a small local model to match a frontier cloud model. A 3B model is genuinely useful for drafting, summarizing, and classification, but it hallucinates more and reasons less reliably than the largest hosted systems. Match the model to the job.
Finally, a quantized model trades some quality for size. If answers feel noticeably worse than you expected, try a less aggressive quantization or a larger parameter count before concluding the model is bad. The next tutorial in this cluster covers exactly how to read those quantization labels.

Stop guessing from vibes. A repeatable way to decide if a model clears the bar for your specific job, using your own data.
BitByteCore Research · May 19, 2026 · 3 min read
Discussion