In Depth

Flash Attention, introduced by Tri Dao in 2022, is an exact attention algorithm that achieves 2-4x speedup over standard implementations by rethinking how attention is computed at the hardware level. Rather than materializing the full attention matrix in GPU high-bandwidth memory (HBM), it computes attention in tiles that fit in the faster on-chip SRAM, dramatically reducing memory reads and writes.

The key insight is that attention computation is memory-bandwidth-bound, not compute-bound. Standard implementations waste time moving large intermediate matrices between slow HBM and fast SRAM. Flash Attention restructures the computation to minimize these data transfers, computing the exact same result but with far fewer memory operations. Flash Attention 2 and 3 further optimize by improving parallelism and hardware utilization.

Flash Attention has become essential infrastructure for training and serving large language models. It enables longer context windows (since the memory savings mean you can fit longer sequences), faster training, and more efficient inference. It is now the default attention implementation in most major frameworks and is integrated into libraries like PyTorch, Hugging Face Transformers, and vLLM.