The attention mechanism is the core innovation that makes modern AI models like GPT-4 and Claude possible. It allows a model to focus on the most relevant parts of its input when producing each piece of output — similar to how you pay more attention to key words when reading a sentence.
The problem attention solves: Imagine translating "The cat sat on the mat" to French. When generating the French word for "cat," the model needs to focus heavily on "cat" in the English sentence and less on "on" or "the." Before attention, models tried to compress the entire input into a single fixed-size representation, which lost information — especially for long sequences.
How attention works mathematically:
For each token (word piece) in the sequence, the model computes three vectors:
- Query (Q): "What information am I looking for?"
- Key (K): "What information do I contain?"
- Value (V): "What information should I pass along?"
The attention score between two tokens is computed by taking the dot product of one token's Query with another token's Key. High dot products mean high relevance. These scores are normalized (using softmax) into weights that sum to 1, then used to create a weighted combination of Value vectors.
Self-attention is when a sequence attends to itself. In "The bank by the river was muddy," self-attention helps the model connect "bank" with "river" and "muddy" to determine that "bank" means riverbank, not a financial institution.
Multi-head attention runs multiple attention computations in parallel, each with different learned weight matrices. Different heads can capture different types of relationships — one head might track grammatical structure, another might track semantic similarity, another might track coreference (what "it" refers to).
Types of attention in practice:
- Encoder self-attention: Every token attends to every other token (bidirectional). Used in models like BERT for understanding text.
- Decoder self-attention: Each token can only attend to previous tokens (masked/causal). Used in GPT-style models for text generation.
- Cross-attention: Output sequence attends to input sequence. Used in translation and image captioning.
Why attention is computationally expensive: Computing attention between all pairs of tokens in a sequence of length N requires N-squared operations. For a context window of 100,000 tokens, that's 10 billion attention computations per layer. This is why longer context windows are more expensive and why researchers develop efficient attention variants like Flash Attention, sparse attention, and linear attention.
The practical impact: Attention is why modern AI models understand context so well. It's the difference between a model that understands "Apple released a new product" is about technology and "I ate an apple" is about fruit.