Vol.01 · No.10 CS · AI · Infra May 14, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI

Prompt Caching

Difficulty

Plain Explanation

Long prompts often repeat the same content, like policies, tool definitions, or examples. Reprocessing those thousands of tokens on every request slows apps and drives up input-token cost. Providers introduced prompt caching to reuse the work for that repeated beginning. You can picture it like a reader who places a bookmark at the end of the shared introduction. Next time, they don’t reread the intro; they jump straight to the new chapter.

With prompt caching, if the first part of the prompt is exactly the same and long enough (commonly 1,024+ tokens), the system reuses computations and starts from where that shared part ends. Concretely, services identify identical prefixes and route those requests so they land on the same cache. On a cache hit, they reuse stored key/value tensors from the model’s attention layers, skipping prefill for those tokens. Vendors may expose controls like a prompt_cache_key and retention options that keep cached prefixes in memory for minutes or, on supported models, for up to 24 hours. Reported metrics (for example, cached_tokens) confirm how much was reused.

Examples & Analogies

  • Policy-heavy support assistant: A customer-service flow starts with a long, fixed system message describing tone, escalation rules, and refund policy. Each ticket adds just a short, user-specific question at the end, so identical prefixes are reused and responses come back faster.
  • Tool-enabled coding helper: The app sends the same tool definitions and usage schema at the top of every request, with only the current file diff appended later. Because the front matter matches exactly, the cache skips recomputing those tokens.
  • Campaign copy generator: A brand brief and style guide stays fixed while marketers generate many variants for headlines and CTAs. The identical prefix triggers cache hits, reducing per-variant latency and cost.

At a Glance

Prompt cachingResponse cachingSemantic caching
What is reusedModel's computed prefix state (KV tensors)The full previous answerA stored answer for a similar prompt
Match requirementExact identical prefixExact identical requestEmbedding similarity ("close enough")
Output each callRecomputed freshReplayed verbatimReplayed from best match
Main benefitLower latency and input costZero compute on repeatsWorks when text isn’t identical
Typical riskMinor edits break the matchStale or misapplied answersWrong match for look‑alikes

Prompt caching speeds up requests without changing outputs, while response and semantic caching trade freshness or precision for reuse.

Where and Why It Matters

  • Cost/latency reductions: Identical, long prefixes can yield substantial latency and input-cost reductions.
  • Engineering practice shift: Teams front‑load static instructions/examples and push user‑specific details to the end to maximize exact-prefix matches, often tracking cached_tokens.
  • Billing and retention: Some providers apply discounts or distinct pricing to cached tokens, with in-memory caching for minutes and optional extended retention on supported models (up to 24 hours).
  • Operational constraints: Benefits occur only with long prompts (commonly 1,024+ tokens) and exact early-token matches; small edits in the initial window result in misses.

Common Misconceptions

  • ❌ Myth: It replays old answers. → ✅ Reality: Only the shared prefix’s computations are reused; the model still generates a fresh response each time.
  • ❌ Myth: Close-enough text will hit the cache. → ✅ Reality: Providers require an exact prefix match in the early tokens; small edits cause misses.
  • ❌ Myth: It always cuts costs. → ✅ Reality: Savings depend on your hit rate and pricing for cache reads and writes.

How It Sounds in Conversation

  • "Let’s move the static tool schema to the top so cached_tokens actually spikes over 1k."
  • "We set a stable prompt_cache_key; hit rate should hold as long as the prefix is byte-identical."
  • "The first user message changed a character in the intro, so we got 0 cached_tokens—make the intro immutable."
  • "We should schedule a weekly cost/latency review to confirm savings from cache hits."

Related Reading

References

Helpful?