Vol.01 · No.10 CS · AI · Infra May 15, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI Infra & Hardware

Speculative Decoding

Difficulty

Plain Explanation

LLMs usually generate one token at a time. Each step calls the large target model again, so long answers become slow and expensive. Speculative decoding reduces that bottleneck by letting a smaller model, or extra speculative heads, draft several candidate tokens first. The target model then verifies the draft in one pass, keeps the longest valid prefix, and restarts from the first wrong token.

Examples & Analogies

  • Typist and editor: a fast assistant writes a pencil draft, and the expert editor keeps only the part they would have written.
  • Chat responses: common phrases often have high acceptance, so the system can advance multiple tokens per target pass.
  • Code completion: boilerplate can benefit a lot, while uncertain or creative text may trigger more rejection.

At a Glance

MethodCore ideaBenefitWatch out
Standard decodingTarget model generates one token at a timeSimple and stableSlow for long outputs
Speculative decodingDrafter proposes, target verifiesFewer target passesSensitive to acceptance rate
Assisted generationLibrary-level assistant model pathEasier to tryModel pair choice matters
Multi-head speculationFuture-token heads inside one modelLess model orchestrationTraining and serving integration needed

Where and Why It Matters

This is a serving optimization, not a way to make the model know more. In lossless variants, the final output distribution remains the target model’s distribution while the runtime tries to move several tokens forward per target-model pass. Real gains depend on acceptance rate, draft length, KV cache handling, and batch scheduling. It is most relevant for chatbots, IDE assistants, and agent loops where response latency is part of product quality.

Common Misconceptions

  • “The small model answers instead” → the small model drafts; the target model decides what survives.
  • “It is always faster” → if the drafter is often wrong, verification and rollback erase the benefit.
  • “Quality necessarily drops” → lossless speculative sampling preserves the target distribution when the acceptance rule is implemented correctly.
  • “It is just a short code trick” → production serving still needs cache, batching, tokenizer, and fallback handling.

How It Sounds in Conversation

  • “Let’s measure acceptance rate before increasing draft length.”
  • “Track rollback cost and p95 latency, not just average tokens per second.”
  • “The assistant and target tokenizers need to be compatible.”
  • “Run the A/B with identical decoding settings and only speculative mode changed.”

Related Reading

References

Helpful?