In Depth
Speculative decoding addresses a fundamental bottleneck in language model inference: large models generate text one token at a time, and each token requires a full forward pass through the model. This is slow because the computation for each token must wait for the previous one. Speculative decoding uses a small, fast 'draft' model to predict several tokens ahead, then verifies all of them with the large model in a single parallel pass.
The verification step ensures that the output is identical in distribution to what the large model would have generated on its own. If the draft model's predictions are accepted, the large model effectively generates multiple tokens in the time it would normally take for one. The speedup depends on how well the draft model approximates the large model; typical acceptance rates of 60-80% translate to 2-3x throughput improvements.
Speculative decoding is particularly valuable for production deployment of large language models, where latency directly impacts user experience and cost. It requires no changes to model weights and produces identical outputs, making it a free performance win. Many inference engines including vLLM and TensorRT-LLM now support speculative decoding as a standard optimization.