vLLM

vLLM

Definition A high-throughput, memory-efficient inference engine specifically designed for serving large language models, featuring PagedAttention for optimal GPU memory management.

In Depth

vLLM is an open-source library for fast and efficient LLM inference and serving, developed at UC Berkeley. Its key innovation is PagedAttention, which manages KV cache memory like an operating system manages virtual memory, using non-contiguous memory pages to eliminate the memory waste that occurs with traditional contiguous allocation.

PagedAttention typically reduces KV cache memory waste from 60-80% to near zero, allowing vLLM to serve 2-4x more concurrent requests than naive implementations with the same GPU memory. vLLM also implements continuous batching (adding new requests to running batches without waiting for all current requests to finish), prefix caching (sharing common prompt prefixes), and speculative decoding.

vLLM has become the de facto standard for self-hosting open-weight language models in production. It supports most popular model architectures (Llama, Mistral, Phi, Falcon, etc.), provides an OpenAI-compatible API for easy migration, and integrates with model formats like GPTQ and AWQ for quantized deployment. Its combination of high throughput, ease of use, and active open-source development has made it the most popular LLM serving framework.

In Depth

Browse more terms