A context window is not memory. It is the fixed amount of text a model can look at in a single pass, and everything outside it simply does not exist to the model.
A deep read — the full picture, with the receipts.
The phrase "context window" gets thrown around as if it were the model's memory, and that framing causes real mistakes. People assume a model with a large context window remembers your whole history, or that pasting a long document means the model has read and retained it. Neither is true. A context window is the maximum amount of text, measured in tokens, that a model can take in for a single forward pass. It is closer to working memory held in your hand than to long-term memory in your head, and everything outside it is invisible. Understanding the mechanics explains a lot of otherwise baffling behavior.
Tokens, not words or characters#
The window is measured in tokens, and a token is not a word. Tokenizers split text into common chunks: frequent words become a single token, rare words get broken into pieces, and whitespace and punctuation count too. A rough rule for English is that one token is about four characters, so a 1,000-word document might be 1,300 to 1,500 tokens. Code, non-English text, and unusual formatting tokenize less efficiently and eat the window faster.
The window covers both directions at once. Your prompt, any system instructions, the document you pasted, and the model's own reply all draw from the same budget. If a model has an 8,000-token window and your input is 7,500 tokens, the reply has only 500 tokens of room. People who get mysteriously truncated answers are often just running out of window.
Why there is a limit at all#
The limit is not arbitrary. It comes from attention, the mechanism transformers use to relate every token to every other token. In the standard form, the cost of attention grows with the square of the sequence length. Double the context and you roughly quadruple the work and the memory needed to hold the intermediate scores.







Discussion