In Depth

The attention score between two tokens is computed as the dot product of their query and key vectors, scaled and softmaxed to produce a probability distribution, then used to weight their value vectors. Multi-head attention runs several attention operations in parallel with different learned projections, capturing multiple types of relationships. The O(n²) cost of full attention over sequence length n drives ongoing research into efficient attention approximations.