Speculative Decoding
Plain Explanation
LLMs usually generate one token at a time. Each step calls the large target model again, so long answers become slow and expensive. Speculative decoding reduces that bottleneck by letting a smaller model, or extra speculative heads, draft several candidate tokens first. The target model then verifies the draft in one pass, keeps the longest valid prefix, and restarts from the first wrong token.
Examples & Analogies
- Typist and editor: a fast assistant writes a pencil draft, and the expert editor keeps only the part they would have written.
- Chat responses: common phrases often have high acceptance, so the system can advance multiple tokens per target pass.
- Code completion: boilerplate can benefit a lot, while uncertain or creative text may trigger more rejection.
At a Glance
| Method | Core idea | Benefit | Watch out |
|---|---|---|---|
| Standard decoding | Target model generates one token at a time | Simple and stable | Slow for long outputs |
| Speculative decoding | Drafter proposes, target verifies | Fewer target passes | Sensitive to acceptance rate |
| Assisted generation | Library-level assistant model path | Easier to try | Model pair choice matters |
| Multi-head speculation | Future-token heads inside one model | Less model orchestration | Training and serving integration needed |
Where and Why It Matters
This is a serving optimization, not a way to make the model know more. In lossless variants, the final output distribution remains the target model’s distribution while the runtime tries to move several tokens forward per target-model pass. Real gains depend on acceptance rate, draft length, KV cache handling, and batch scheduling. It is most relevant for chatbots, IDE assistants, and agent loops where response latency is part of product quality.
Common Misconceptions
- “The small model answers instead” → the small model drafts; the target model decides what survives.
- “It is always faster” → if the drafter is often wrong, verification and rollback erase the benefit.
- “Quality necessarily drops” → lossless speculative sampling preserves the target distribution when the acceptance rule is implemented correctly.
- “It is just a short code trick” → production serving still needs cache, batching, tokenizer, and fallback handling.
How It Sounds in Conversation
- “Let’s measure acceptance rate before increasing draft length.”
- “Track rollback cost and p95 latency, not just average tokens per second.”
- “The assistant and target tokenizers need to be compatible.”
- “Run the A/B with identical decoding settings and only speculative mode changed.”
Related Reading
References
- Fast Inference from Transformers via Speculative Decoding
Original paper describing draft-model proposals verified by the target model while preserving the target distribution.
- Accelerating Large Language Model Decoding with Speculative Sampling
Explains acceptance/rejection sampling for lossless acceleration of autoregressive decoding.
- Assisted generation
Official library-facing documentation for assistant-model generation.
- A Hitchhiker's Guide to Speculative Decoding
Engineering guide covering speculative heads, verification, and serving integration concerns.
- Looking back at speculative decoding
Retrospective on when the method helps and why acceptance rate controls real speedup.