
Wire up retrieval-augmented generation from scratch: chunk your documents, embed them, store the vectors, and feed the right context into a local model so it answers from your data.
A deep read — the full picture, with the receipts.

Wire up retrieval-augmented generation from scratch: chunk your documents, embed them, store the vectors, and feed the right context into a local model so it answers from your data.
A deep read — the full picture, with the receipts.
By the end of this tutorial you will understand how to build a retrieval-augmented generation (RAG) pipeline that lets a language model answer questions using your own documents instead of only what it learned in training. You need a local LLM runtime (an Ollama setup works well), an embedding model, and a small collection of text you want to query: notes, docs, a knowledge base, anything. No cloud services are required.
RAG exists to solve one problem: a model only knows what it was trained on, and its context window is finite. RAG fetches the few passages most relevant to a question and pastes them into the prompt, so the model answers from fresh, specific, private data it never saw during training.
Models read fixed-size context, so you cannot dump a whole library into one prompt. Split each document into chunks of a few hundred tokens. Overlap adjacent chunks slightly so a sentence split across a boundary still lands intact in at least one piece.
chunk_size = 400 tokens
overlap = 50 tokens
Chunk too small and each piece loses the context that makes it meaningful. Chunk too large and retrieval gets coarse and you waste prompt space. A few hundred tokens with light overlap is a sound default to tune from.
An embedding model turns a chunk of text into a vector, a list of numbers that captures its meaning. Passages about similar topics land near each other in this vector space. Run every chunk through a local embedding model:
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Your chunk of text goes here."
}'
The response contains an embedding array. Embed your query with the same model you used for the chunks. Mixing embedding models breaks the comparison, because two different models do not place text in the same coordinate system.
Keep each chunk's vector alongside the original text and a reference back to its source. For a few thousand chunks, a lightweight vector store or even an in-memory list works. Past that, use a dedicated vector database that indexes the vectors for fast nearest-neighbor search. Each stored record needs three things:
The metadata is what lets you cite where an answer came from later.
At query time, embed the user's question, then find the stored vectors closest to it. Closeness is measured with cosine similarity, which compares the angle between two vectors rather than their length. Return the top few matches, commonly three to five:
query_vector = embed(question)
top_k = nearest_by_cosine(query_vector, stored_vectors, k=4)
Those top-k chunks are your evidence. Retrieving too many floods the prompt with noise and pushes out the model's attention; too few risks missing the passage that holds the answer.
Now build the final prompt by stitching the retrieved chunks into a clear instruction. The structure matters more than the wording:
Use only the context below to answer. If the answer is not
in the context, say you do not know.
Context:
<retrieved chunk 1>
<retrieved chunk 2>
Question: <the user question>
Send that to your local model. Because the relevant passages sit right there in the prompt, the model can ground its answer in your data and quote specifics. The instruction to admit ignorance when the context lacks an answer is what keeps it from filling gaps with invented facts.
Since every chunk carries metadata, append the source of each retrieved passage to the answer. This turns an opaque response into one a reader can verify, and it is the single most valuable habit in any RAG system.
The failure most people hit first is bad chunking. If retrieval keeps returning passages that are almost-but-not-quite relevant, your chunk size is usually the culprit. Re-chunk with different sizes and overlap before touching anything else.
The second is the embedding mismatch: using one model to embed documents and another to embed the query. The vectors live in different spaces and similarity becomes meaningless. Always embed both sides with the same model.
Third, retrieval is only as good as what is in the store. If a fact was never ingested, no amount of prompt tuning will surface it, and the model will either say it does not know or quietly hallucinate. RAG extends a model's knowledge; it does not make the model omniscient. When answers go wrong, check what was retrieved before you blame the model.

Adapt a small open model to your task using LoRA: prepare a clean instruction dataset, train lightweight adapters, and know when fine-tuning is the wrong tool entirely.
Muniba K. · May 22, 2026 · 4 min read
Discussion