In Depth

In transformer models, multi-head attention divides the attention mechanism into multiple parallel 'heads,' each operating on a different learned projection of the input. Each attention head independently computes which parts of the input to focus on, and the results from all heads are concatenated and combined. This allows the model to simultaneously attend to different aspects of the input: one head might track syntactic relationships while another captures semantic similarity.

The number of attention heads is a key architectural hyperparameter. GPT-style models typically use 32 to 128 heads depending on model size. Research has shown that different heads specialize in different linguistic patterns: some track subject-verb agreement, others handle coreference resolution, and others capture positional relationships. However, not all heads are equally important; some can be pruned with minimal performance impact.

Understanding attention heads is important for model interpretability, efficiency optimization, and architecture design. Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) reduce the number of key-value heads while maintaining query heads, significantly reducing KV cache memory during inference. These variants have become standard in modern LLM architectures for their balance of quality and efficiency.