By the end of this tutorial you will have a working large language model running entirely on your own computer, reachable from both a terminal and a local HTTP endpoint. You need a machine with at least 8 GB of RAM (16 GB is more comfortable), a few gigabytes of free disk space, and a working internet connection for the one-time model download. A GPU helps but is not required. Ollama wraps the llama.cpp inference engine, so everything here runs on CPU if you have no graphics card.

Step 1: Install Ollama#

Ollama ships installers for macOS, Windows, and Linux. On macOS and Windows you download the app and run it. On Linux the one-line script is the fastest path:

curl -fsSL https://ollama.com/install.sh | sh

When it finishes, confirm the binary is on your PATH:

ollama --version

If that prints a version string, the background service is already running and listening on localhost.

Step 2: Pull and run a model#

Models live in a registry, much like container images. Pulling one downloads the weights to your disk once, then caches them. Start a small instruction-tuned model so the download stays modest:

ollama run llama3.2

The first invocation downloads the weights, then drops you into an interactive prompt. Type a question and you get a reply generated locally. Use a triple-quote to send multi-line input, and type /bye to exit. The model stays cached, so the next ollama run starts almost instantly.

Pick a model whose size fits your RAM. A rough rule: a 4-bit quantized model needs roughly its parameter count in gigabytes of memory. A 3B model wants about 3 to 4 GB free; an 8B model wants closer to 8 GB. If your machine swaps hard or the reply crawls out one token at a time, drop to a smaller model.

Step 3: List and manage what you have#

Over time you will collect several models. These commands keep the set tidy:

ollama list
ollama rm llama3.2

list shows each model with its on-disk size so you can reclaim space, and rm deletes weights you no longer want.

Step 4: Call the local HTTP API#

The interactive prompt is convenient, but the real power is the local server Ollama runs on port 11434. Any program on your machine can hit it. A plain curl request looks like this:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain a hash map in two sentences.",
  "stream": false
}'

The response is JSON with the generated text in a response field. Setting "stream": true instead returns tokens as they are produced, which is what you want for a chat UI. Because the endpoint speaks a stable JSON shape, you can wire it into scripts, editor plugins, or a small web app without any external dependency.

Step 5: Keep a model warm#

Loading weights into memory takes a few seconds. If you are calling the API repeatedly, that cold start adds up. Send an empty prompt to load the model and keep it resident:

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": "30m"}'

The keep_alive field tells Ollama how long to hold the model in RAM after the last request. Raise it for an interactive session, lower it to free memory between bursts.

Where this breaks#

The most common failure is memory pressure. If a model is larger than your free RAM, the OS pages weights to disk and generation slows to a near halt, or the process is killed outright. Watch system memory and step down a model size before you blame the tool.

The second trap is expecting a small local model to match a frontier cloud model. A 3B model is genuinely useful for drafting, summarizing, and classification, but it hallucinates more and reasons less reliably than the largest hosted systems. Match the model to the job.

Finally, a quantized model trades some quality for size. If answers feel noticeably worse than you expected, try a less aggressive quantization or a larger parameter count before concluding the model is bad. The next tutorial in this cluster covers exactly how to read those quantization labels.

How to run a local LLM on your own machine with Ollama

More in software

Serve a local model as an API endpoint

Discussion