Attention is not the model reading your text like a person. It is a weighted lookup that lets every word pull context from every other word at once.
A deep read — the full picture, with the receipts.
Most explanations of transformers start with a diagram of arrows and the word "attention" in a box, then stop. That leaves people thinking attention is some kind of focus, as if the model is reading carefully and deciding what to care about. It is simpler and stranger than that. Attention is a math operation that lets each position in a sequence gather information from every other position, all in one pass, with no notion of reading order built in. The transformer is the architecture that stacks this operation into something that can model language. Here is what is actually happening under the hood.
Everything becomes a vector first#
A transformer cannot work with words. It works with numbers. So the first thing it does is chop your text into tokens (roughly word-fragments) and map each token to a list of numbers called an embedding. An embedding might be 768 or 4096 numbers long, and its job is to place that token somewhere in a high-dimensional space where similar meanings sit near each other.
There is a catch. Once you turn a sentence into a bag of vectors, you have lost the order. "Dog bites man" and "man bites dog" become the same set. So the model adds a positional signal to each embedding, a second set of numbers that encodes where the token sits. Now position is baked into the vector itself, not into the structure of the network. This is why a transformer can look at the whole sequence at once instead of marching through it left to right.
Attention is a weighted lookup#
This is the part everyone hand-waves. For every token, the model produces three vectors by multiplying the embedding against three learned weight matrices:







Discussion