ArticleaiDeep read

How a transformer model actually works

Signal DeskMay 13, 20264 minUpdated Jul 11, 2026

Attention is not the model reading your text like a person. It is a weighted lookup that lets every word pull context from every other word at once.

A deep read — the full picture, with the receipts.

Signalstrong2independent sources

Most explanations of transformers start with a diagram of arrows and the word "attention" in a box, then stop. That leaves people thinking attention is some kind of focus, as if the model is reading carefully and deciding what to care about. It is simpler and stranger than that. Attention is a math operation that lets each position in a sequence gather information from every other position, all in one pass, with no notion of reading order built in. The transformer is the architecture that stacks this operation into something that can model language. Here is what is actually happening under the hood.

Everything becomes a vector first#

A transformer cannot work with words. It works with numbers. So the first thing it does is chop your text into tokens (roughly word-fragments) and map each token to a list of numbers called an embedding. An embedding might be 768 or 4096 numbers long, and its job is to place that token somewhere in a high-dimensional space where similar meanings sit near each other.

There is a catch. Once you turn a sentence into a bag of vectors, you have lost the order. "Dog bites man" and "man bites dog" become the same set. So the model adds a positional signal to each embedding, a second set of numbers that encodes where the token sits. Now position is baked into the vector itself, not into the structure of the network. This is why a transformer can look at the whole sequence at once instead of marching through it left to right.

Attention is a weighted lookup#

This is the part everyone hand-waves. For every token, the model produces three vectors by multiplying the embedding against three learned weight matrices:

a query: what this token is looking for
a key: what this token offers to others
a value: the actual content this token will hand over

To decide how much one token should pull from another, the model takes the dot product of one token's query with another token's key. A high score means the two are relevant to each other. Those scores are normalized into weights that sum to one, and then each token's new representation becomes a weighted blend of all the value vectors in the sequence.

Think of it like a room where everyone shouts a question (query) and holds up a label (key). You listen to all the labels, weight them by how well they match your question, and then average what those people are saying (values) into your own notes. The word "it" in a sentence can score highly against the noun it refers to and pull that meaning in, without anyone hard-coding a grammar rule.

Many heads, looking for different things#

One attention operation captures one kind of relationship. Real language has many at once: subject and verb agreement, what a pronoun points to, the topic of the paragraph. So transformers run several attention operations in parallel, called heads, each with its own query, key, and value matrices. One head might track syntax while another tracks long-range topic. The outputs are concatenated and mixed back together. This is "multi-head attention," and it is the reason a single layer can model more than one type of dependency.

The stack is where depth comes from#

A transformer block is attention followed by a small feed-forward network applied to each position, wrapped in two helpers that make deep networks trainable:

Component	What it does
Multi-head attention	Mixes information across tokens
Feed-forward layer	Transforms each token on its own
Residual connection	Adds the input back to the output so gradients survive depth
Layer normalization	Keeps the numbers in a stable range

Stack dozens of these blocks and each layer refines the representation a little more. Early layers tend to capture surface patterns. Later layers capture more abstract structure. Nobody programmed that division of labor. It emerged from training on enough text.

What it does not do#

A transformer does not plan a sentence before writing it. A decoder-style model produces one token at a time, feeds that token back in, and predicts the next. There is no internal outline. "Attention" is not attention in the human sense, and a head is not a little homunculus that understands grammar. These are weighted averages over learned vectors, repeated at scale.

That scale is the whole trick. The architecture is not especially clever on its own. It is parallelizable, it handles long-range dependencies without a fixed window, and it keeps working as you add parameters and data. Those three properties are why it displaced the recurrent networks that came before it, and why almost every large language model today is some variation on this same stack.

Frequently asked questions

What is attention in a transformer model?

Attention is a math operation, specifically a weighted lookup, that lets each position in a sequence gather information from every other position in a single pass. It is not the model reading carefully or focusing in the human sense.

Why do transformers add positional information to embeddings?

Turning a sentence into a set of vectors loses word order, so 'dog bites man' and 'man bites dog' would look identical. The model adds a positional signal to each embedding so position is baked into the vector, letting it process the whole sequence at once instead of left to right.

What are query, key, and value vectors?

For each token the model produces three vectors by multiplying the embedding against learned weight matrices: a query (what the token is looking for), a key (what it offers others), and a value (the content it hands over). Attention scores come from the dot product of one token's query with another's key.

What is multi-head attention?

It runs several attention operations in parallel, called heads, each with its own query, key, and value matrices. One head might track syntax while another tracks long-range topic, so a single layer can model more than one type of dependency.

Why did transformers replace recurrent networks?

Transformers are parallelizable, handle long-range dependencies without a fixed window, and keep improving as you add parameters and data. Those three properties let them displace the recurrent networks that came before them.