In Depth
Standard self-attention in transformers computes relationships between every pair of positions in the input, resulting in computational cost that grows quadratically with sequence length. Sparse attention addresses this by restricting each position to attend to only a subset of other positions, typically using patterns like local windows, strided patterns, or learned sparse patterns.
Variants include Longformer (combining local and global attention), BigBird (adding random attention connections), and block-sparse approaches. These methods reduce the quadratic O(n^2) complexity to O(n log n) or even O(n), enabling models to process much longer sequences with the same computational resources.
Sparse attention has been critical for extending context windows in large language models and for applying transformers to domains with very long sequences, such as genomics, long document processing, and high-resolution image generation. The trade-off is that sparse patterns may miss some long-range dependencies that full attention would capture, so the choice of sparsity pattern must be carefully designed for each application.