Serve a local model as an API endpoint

Muniba K.May 20, 20263 min

Turn a model running on your machine into a clean HTTP endpoint your apps can call, with the concurrency and memory traps spelled out.

Step-by-step — built to follow along.

Signalforming

Discussion

Loading…

Prerequisites#

A working compute stack (driver, toolkit, framework) if you are serving on a GPU, verified with a tiny workload first.

The model weights downloaded locally, with their license read.

An inference server you have chosen for your model family. Prefer a maintained one over a hand-rolled loop.

A way to test HTTP from the terminal, such as curl.

Step 1: Load the model once, not per request#

The single most important rule of serving: load the weights at startup and hold them in memory. Loading per request is the classic mistake that makes the first call take many seconds and every call after it just as slow. A real server does this for you.

# Start the inference server, pointing it at your local weights.
# It loads the model once and stays resident.
inference-server --model /path/to/model --host 0.0.0.0 --port 8000

Step 4: Set sane limits#

An endpoint with no limits is a denial-of-service waiting to happen. Cap the maximum output length and the number of requests processed at once. The right concurrency number depends on your memory, not your CPU core count.

inference-server --model /path/to/model \
  --max-tokens 512 \
  --max-concurrent 4

Pitfalls#

Reloading per request. If your first and second calls are equally slow, you are loading the model every time. Use a server that keeps it resident.

Unbounded concurrency. Each in-flight request consumes memory. Too many at once causes out-of-memory crashes, not a graceful slowdown. Cap concurrency to what your VRAM allows.

No output cap. Without a max-tokens limit, a single request can run for a very long time and starve everything else.

Binding to all interfaces by accident. Listening on 0.0.0.0 exposes the endpoint to the network. On a shared machine, bind to localhost unless you have put real auth in front.

No timeout on callers. A hung generation with no client timeout looks like an app freeze. Set one everywhere you call the endpoint.

Skipping the health check. Wiring three services to an endpoint that never actually started turns a one-line fix into a debugging session. Confirm the server is up first.

An inference endpoint is infrastructure. Treat it like any other service: one stable address, explicit limits, health checks, and timeouts. Once those are in place, swapping the model behind the URL becomes a config change rather than a rewrite.

Serve a local model as an API endpoint

More in software

Set up GPU drivers and toolkit for local AI work

Discussion

Prerequisites#

Step 1: Load the model once, not per request#

Step 2: Confirm it is up before integrating#

Step 3: Send a real request#

Step 4: Set sane limits#

Step 5: Put it behind a stable address#

Step 6: Add a timeout on the client side#

Pitfalls#

Evaluate whether a model is good enough for your task

How to run a local LLM on your own machine with Ollama