“A meeting room where every word decides which other words to listen to.”
The problem attention solves
Take the sentence "The trophy didn't fit in the brown suitcase because it was too big." What does "it" refer to? Obviously the trophy. Now flip it: "…because it was too small." Now "it" refers to the suitcase. To resolve "it," you have to look at the rest of the sentence and decide what matters.
Earlier architectures (RNNs, LSTMs) processed text left-to-right, one word at a time, hoping the relevant context survived in a hidden state by the time "it" arrived. They were slow, hard to train, and often forgot. The transformer (2017) said: forget left-to-right. Let every word look at every other word, *at the same time*, and decide what to weight. That's attention.
Self-attention, conceptually
Every token has three roles played by three vectors derived from its embedding: a query ("what am I looking for?"), a key ("what do I match?"), and a value ("what do I contribute if matched?").
For a given token, compare its query against every other token's key (a dot product gives a similarity score). Softmax those scores into a probability distribution over all tokens. Now sum every token's value, weighted by its score. The result is a new representation for this token, blended from the parts of the sentence it found relevant.
Do this for every token in parallel. The whole layer is one giant matrix multiplication. The whole layer runs on a GPU in milliseconds even for thousands of tokens.
Multiple heads, multiple conversations
One attention layer learns one "way of looking." But a sentence has many simultaneous structures (syntactic, semantic, coreference, sentiment). So transformers run attention in parallel several times per layer ("heads"), each with its own learned Q/K/V projections. One head might track "what is the subject of this verb?" Another might track "what is this pronoun referring to?" Another might track "is this a question?"
Most production transformers have somewhere between 16 and 128 attention heads per layer, and somewhere between 30 and 120 layers stacked on top of each other. The total amount of "looking around" the model does for a single token is staggering.
The full transformer block
A transformer is a stack of identical blocks. Each block does two things:
- Self-attention layer: every token attends to every other token and updates its representation.
- Feed-forward layer: each token's representation passes through a small private neural network (the same network for every token, applied independently). This is where most of the model's raw "knowledge" lives, parameter-wise.
Both layers have residual connections (the input is added back to the output) and a normalisation step. The residual connection is what lets you stack 100+ layers without the signal degrading. Without it, deep transformers don't train.
# one transformer block, conceptually
def block(x):
x = x + multi_head_attention(layer_norm(x))
x = x + feed_forward(layer_norm(x))
return x
Where does position come in?
Attention itself is permutation-invariant; it doesn't know which word came first. Sentences clearly do care about order ("dog bites man" vs "man bites dog"). So transformers add a positional encoding to each token's embedding at the input. Either a fixed sinusoidal pattern (original 2017 paper) or, more recently, learned rotary positional embeddings (RoPE). The model learns to use that signal to encode order.
Causal masking: making it a language model
A language model predicts the next token given previous tokens. So when computing attention for token *t*, it must not be allowed to see tokens *t+1, t+2…*, as that would be cheating. The fix is a causal mask: zero out the attention scores from each token to all future tokens. The model is forced to predict only from the past.
This is the difference between GPT-family models (causal, decoder-only) and BERT-family models (bidirectional, encoder-only, no mask). GPT generates text; BERT understands text. Most modern LLMs are decoder-only because generating turns out to subsume understanding.
Why this scales so well
Every token attending to every token is O(n²) in sequence length. For n=4000 tokens that's 16 million dot products per head per layer. Big, but trivially parallel on a GPU. The transformer's computational pattern (massive matrix multiplications) is exactly what modern accelerators are built for. That's why it eats compute well and why scaling laws keep holding.
The quadratic cost is also why context windows are hard to grow. Most of the engineering effort behind "1M token context" models goes into clever approximations: sparse attention, sliding windows, KV caching, linear-attention variants.
In one line each
- Self-attention lets every token decide which other tokens it cares about, in parallel, all at once.
- Multi-head attention runs many parallel "ways of looking"; the feed-forward layers store most of the raw knowledge.
- Residual connections + layer norm are what make 100+ layer transformers trainable.
- Position is injected separately because attention alone has no notion of order; KV cache is what makes generation cheap.
Where to go next