What It Is

The transformer is a neural network architecture introduced in the 2017 Google paper "Attention Is All You Need." It replaced recurrent architectures (RNNs, LSTMs) for sequence processing tasks and has since become the foundation of nearly every frontier AI system — large language models like GPT-4 and Claude, vision models like ViT, and multimodal AI systems that process text, images, and audio simultaneously.

The key innovation is the self-attention mechanism, which allows every element in a sequence to attend to every other element, capturing long-range dependencies without the sequential bottleneck of recurrent networks. This parallelism also makes transformers highly efficient on modern AI chips like GPUs and TPUs, which excel at matrix multiplication.

How Self-Attention Works

Self-attention computes relationships between all pairs of elements in a sequence. For each element (a token in text, a patch in an image), the mechanism generates three vectors: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I provide?).

Attention scores are computed by taking the dot product of each query with all keys, scaling by the square root of the dimension, and applying a softmax to get normalized weights. These weights determine how much each element attends to every other element. The output is a weighted sum of value vectors.

Multi-head attention runs this process multiple times in parallel with different learned projections, allowing the model to attend to different types of relationships simultaneously — one head might capture syntactic structure while another captures semantic similarity.

The computation scales quadratically with sequence length (every token attends to every other token), which creates challenges for very long sequences. Solutions include sparse attention, sliding window attention (as in Mistral), and linear attention approximations.

Architecture Components

A standard transformer consists of stacked layers, each containing:

Self-attention block — computes attention as described above, with residual connections and layer normalization. Decoder-only models (GPT, Claude) use causal masking so each token can only attend to previous tokens — this prevents information leakage during autoregressive generation.

Feed-forward network (FFN) — a two-layer MLP applied independently to each position. Recent research suggests FFN layers store factual knowledge, while attention layers handle reasoning and information routing. The FFN typically expands the hidden dimension by 4x, applies an activation function (GeLU or SwiGLU), then projects back down.

Positional encoding — since self-attention is permutation-invariant (it doesn't inherently know word order), transformers add positional information. Original transformers used sinusoidal encodings. Modern models use Rotary Position Embeddings (RoPE), which encode relative positions and generalize to sequence lengths beyond training.

Transformer Variants

Encoder-only — processes the full input bidirectionally. BERT is the classic example. Used for classification, named entity recognition, and embedding generation. Each token attends to all other tokens.

Decoder-only — processes tokens autoregressively, predicting the next token. GPT, Claude, LLaMA, and most modern LLMs use this architecture. Simpler to scale and surprisingly effective at understanding tasks despite being trained only to predict the next token.

Encoder-decoder — the original transformer design. The encoder processes the input; the decoder generates the output while attending to encoder representations. T5 and BART use this structure. Common for translation and summarization.

Vision Transformers (ViT) — apply transformers to images by splitting them into patches (typically 16x16 pixels), embedding each patch as a token, and processing with standard transformer layers. ViTs now match or exceed CNNs on computer vision benchmarks.

Scaling and Efficiency

Transformers exhibit predictable scaling laws — performance improves smoothly with more parameters, more training data, and more compute. This predictability, demonstrated by Kaplan et al. (2020) and refined by Chinchilla (2022), is why organizations invest billions in training larger models.

Mixture of Experts (MoE) — routes each token to a subset of specialized FFN layers ("experts"), allowing models to have many more parameters without proportional compute costs. Mixtral 8x7B has 47 billion parameters but only activates 13 billion per token.

Efficient attention — Flash Attention reduces the memory footprint of attention by computing it in a memory-efficient manner, enabling longer sequences. Grouped Query Attention (GQA) shares key-value heads across query heads, reducing memory during inference.

Quantization — reducing weight precision from FP16 to INT8 or INT4 halves or quarters memory and increases inference speed, with minimal accuracy loss for well-calibrated models.

Impact and Legacy

The transformer has unified AI. Before 2017, different architectures dominated different domains — CNNs for vision, RNNs for language, specialized models for speech. Transformers now achieve state-of-the-art results across all modalities. This architectural convergence has enabled multimodal AI systems that process everything through a shared transformer backbone.

The architecture also enabled the era of foundation models — large pre-trained models that can be adapted to thousands of downstream tasks through fine-tuning or prompt engineering.

Challenges

  • Quadratic attention cost — self-attention scales as O(n^2) with sequence length, limiting context windows. Processing a million-token context requires fundamentally different approaches than short sequences.
  • Training instability — very large transformers can experience loss spikes, divergence, and other training instabilities. Mitigations (careful learning rate schedules, gradient clipping, architecture tweaks) are partly empirical.
  • Interpretability — understanding what transformer layers learn and how attention patterns relate to model behavior remains an active research area. See explainable AI.
  • Compute requirements — frontier transformers require thousands of GPUs training for months. This concentrates capability in well-funded organizations.
  • Architectural plateau — despite extensive research, no architecture has convincingly surpassed the transformer for general-purpose AI since 2017. State space models (Mamba) show promise for specific tasks but haven't displaced transformers broadly.