In Depth
In autoregressive text generation, a transformer model generates tokens one at a time. Without caching, each new token would require recomputing the attention keys and values for all previous tokens, making generation increasingly slow as the sequence grows. The KV (key-value) cache stores these computed tensors, so only the new token's keys and values need to be calculated at each step.
The KV cache trades memory for computation. For large models with long context windows, the cache can consume enormous amounts of GPU memory. For example, a 70B parameter model with a 32K context window might need 10+ GB of KV cache per request. This memory pressure is one of the primary constraints on how many concurrent requests a model server can handle.
Optimizing KV cache management has become a critical area in LLM serving. Techniques include PagedAttention (used in vLLM) for efficient memory allocation, multi-query and grouped-query attention to reduce cache size, KV cache quantization to compress cached values, and prefix caching to share common prompt prefixes across requests. These optimizations directly impact the cost and performance of serving language models at scale.