ArticleaiDeep read

What a context window actually is

Signal DeskMay 11, 20264 minUpdated Jul 11, 2026

A context window is not memory. It is the fixed amount of text a model can look at in a single pass, and everything outside it simply does not exist to the model.

A deep read — the full picture, with the receipts.

Signalstrong4independent sources

The phrase "context window" gets thrown around as if it were the model's memory, and that framing causes real mistakes. People assume a model with a large context window remembers your whole history, or that pasting a long document means the model has read and retained it. Neither is true. A context window is the maximum amount of text, measured in tokens, that a model can take in for a single forward pass. It is closer to working memory held in your hand than to long-term memory in your head, and everything outside it is invisible. Understanding the mechanics explains a lot of otherwise baffling behavior.

Tokens, not words or characters#

The window is measured in tokens, and a token is not a word. Tokenizers split text into common chunks: frequent words become a single token, rare words get broken into pieces, and whitespace and punctuation count too. A rough rule for English is that one token is about four characters, so a 1,000-word document might be 1,300 to 1,500 tokens. Code, non-English text, and unusual formatting tokenize less efficiently and eat the window faster.

The window covers both directions at once. Your prompt, any system instructions, the document you pasted, and the model's own reply all draw from the same budget. If a model has an 8,000-token window and your input is 7,500 tokens, the reply has only 500 tokens of room. People who get mysteriously truncated answers are often just running out of window.

Why there is a limit at all#

The limit is not arbitrary. It comes from attention, the mechanism transformers use to relate every token to every other token. In the standard form, the cost of attention grows with the square of the sequence length. Double the context and you roughly quadruple the work and the memory needed to hold the intermediate scores.

Think of it as a meeting where every participant must shake hands with every other participant. Ten people is forty-five handshakes. A hundred people is nearly five thousand. The handshakes are the pairwise comparisons attention performs, and they are why a longer context is not a free upgrade. It costs more compute and more memory for every token you add, which is why large windows arrived slowly and why they are more expensive to use.

Inside the window, not all positions are equal#

Here is the part that surprises people. Having a token inside the window does not guarantee the model uses it well. Models tend to attend most reliably to the beginning and the end of the context, and information buried in the middle of a long input can get effectively overlooked. This is sometimes called the lost-in-the-middle effect. So a 200,000-token window does not mean the model weighs all 200,000 tokens evenly. It means they all fit. Where you place the important material still matters.

The window is stateless between calls#

This is the most consequential point for anyone building with these models. The window does not persist on its own. Each request is independent. The only reason a chat app seems to remember earlier turns is that the application re-sends the prior conversation as part of the input every single time.

The model itself stores nothing between calls.
The chat history you see is replayed into the window on each turn.
When the conversation grows past the window, something has to be dropped or compressed.

That last point is why long chats start to forget their own beginning. The app is silently trimming the oldest turns to make room. The model is not forgetting. It is being shown less.

What this means in practice#

Belief	Reality
A big window means the model remembers me	It only means more text fits in one call
The model read my whole document	It can only use what fits, and favors the ends
The chat remembers our history	The app re-sends the history each turn

The practical advice falls out of the mechanics. Put the most important instructions and material near the start or end of your input, not buried in the middle. Do not assume a long paste was absorbed evenly. And if you need the model to recall something across many turns or many documents, the context window is the wrong tool. You want retrieval, where relevant chunks are fetched and placed into the window on demand, which keeps the window small, focused, and within budget.

Frequently asked questions

What is a context window in an AI model?

A context window is the maximum amount of text, measured in tokens, that a model can take in for a single forward pass. It is like working memory held in your hand rather than long-term memory, and anything outside it is invisible to the model.

Is a context window the same as memory?

No. The window is stateless between calls and the model stores nothing between requests. A chat app only seems to remember earlier turns because the application re-sends the prior conversation as part of the input every time.

How many tokens is a word?

A token is not a word. A rough rule for English is that one token is about four characters, so a 1,000-word document might be 1,300 to 1,500 tokens. Code, non-English text, and unusual formatting use the window faster.

Why do context windows have a size limit?

The limit comes from attention, which relates every token to every other token. Its cost grows with the square of the sequence length, so doubling the context roughly quadruples the compute and memory needed, making large windows slower to arrive and more expensive to use.

Does a large context window mean the model uses all the text equally?

No. Models attend most reliably to the beginning and end of the context, while information buried in the middle can be overlooked, an effect called lost-in-the-middle. A large window means the text fits, not that it is weighed evenly, so placement still matters.

What should you use instead of a large context window for recalling information across many turns or documents?

You want retrieval, where relevant chunks are fetched and placed into the window on demand. This keeps the window small, focused, and within budget.

Sources

Vaswani et al. — Attention Is All You Need (arXiv)arxiv.org
Liu et al. — Lost in the Middle: How Language Models Use Long Contexts (arXiv)arxiv.org
Anthropic — Context windowsplatform.claude.com
OpenAI — Conversation state (API guide)developers.openai.com
Hugging Face — Subword tokenization (LLM Course, Ch. 2)huggingface.co

Taggedcontext-window tokens llm memory attention

What a context window actually is

Tokens, not words or characters#

Why there is a limit at all#

Inside the window, not all positions are equal#

The window is stateless between calls#

What this means in practice#

Frequently asked questions

Sources

More in aiMore in ai→

How a transformer model actually works

RAG vs Fine-Tuning: Which One Your Use Case Actually Needs

The real difference between training and inference

Discussion