Prompt Caching
Plain Explanation
Long prompts often repeat the same content, like policies, tool definitions, or examples. Reprocessing those thousands of tokens on every request slows apps and drives up input-token cost. Providers introduced prompt caching to reuse the work for that repeated beginning. You can picture it like a reader who places a bookmark at the end of the shared introduction. Next time, they don’t reread the intro; they jump straight to the new chapter.
With prompt caching, if the first part of the prompt is exactly the same and long enough (commonly 1,024+ tokens), the system reuses computations and starts from where that shared part ends. Concretely, services identify identical prefixes and route those requests so they land on the same cache. On a cache hit, they reuse stored key/value tensors from the model’s attention layers, skipping prefill for those tokens. Vendors may expose controls like a prompt_cache_key and retention options that keep cached prefixes in memory for minutes or, on supported models, for up to 24 hours. Reported metrics (for example, cached_tokens) confirm how much was reused.
Examples & Analogies
- Policy-heavy support assistant: A customer-service flow starts with a long, fixed system message describing tone, escalation rules, and refund policy. Each ticket adds just a short, user-specific question at the end, so identical prefixes are reused and responses come back faster.
- Tool-enabled coding helper: The app sends the same tool definitions and usage schema at the top of every request, with only the current file diff appended later. Because the front matter matches exactly, the cache skips recomputing those tokens.
- Campaign copy generator: A brand brief and style guide stays fixed while marketers generate many variants for headlines and CTAs. The identical prefix triggers cache hits, reducing per-variant latency and cost.
At a Glance
| Prompt caching | Response caching | Semantic caching | |
|---|---|---|---|
| What is reused | Model's computed prefix state (KV tensors) | The full previous answer | A stored answer for a similar prompt |
| Match requirement | Exact identical prefix | Exact identical request | Embedding similarity ("close enough") |
| Output each call | Recomputed fresh | Replayed verbatim | Replayed from best match |
| Main benefit | Lower latency and input cost | Zero compute on repeats | Works when text isn’t identical |
| Typical risk | Minor edits break the match | Stale or misapplied answers | Wrong match for look‑alikes |
Prompt caching speeds up requests without changing outputs, while response and semantic caching trade freshness or precision for reuse.
Where and Why It Matters
- Cost/latency reductions: Identical, long prefixes can yield substantial latency and input-cost reductions.
- Engineering practice shift: Teams front‑load static instructions/examples and push user‑specific details to the end to maximize exact-prefix matches, often tracking cached_tokens.
- Billing and retention: Some providers apply discounts or distinct pricing to cached tokens, with in-memory caching for minutes and optional extended retention on supported models (up to 24 hours).
- Operational constraints: Benefits occur only with long prompts (commonly 1,024+ tokens) and exact early-token matches; small edits in the initial window result in misses.
Common Misconceptions
- ❌ Myth: It replays old answers. → ✅ Reality: Only the shared prefix’s computations are reused; the model still generates a fresh response each time.
- ❌ Myth: Close-enough text will hit the cache. → ✅ Reality: Providers require an exact prefix match in the early tokens; small edits cause misses.
- ❌ Myth: It always cuts costs. → ✅ Reality: Savings depend on your hit rate and pricing for cache reads and writes.
How It Sounds in Conversation
- "Let’s move the static tool schema to the top so cached_tokens actually spikes over 1k."
- "We set a stable prompt_cache_key; hit rate should hold as long as the prefix is byte-identical."
- "The first user message changed a character in the intro, so we got 0 cached_tokens—make the intro immutable."
- "We should schedule a weekly cost/latency review to confirm savings from cache hits."
Related Reading
References
- GenCache: Generalized Context Caching for LLMs
프리픽스 기반 KV 재사용 연구 동향 개관.
- Prompt caching
Official guide: exact-prefix matching, 1,024+ tokens, routing hash, cache_key, retention.
- Prompt caching with Azure OpenAI in Microsoft Foundry Models
Azure details on retention policies, billing discounts, and cache-hit reporting.
- Prompt caching | Generative AI on Vertex AI
Vertex AI (Claude) explicit caching, TTLs, and pricing for cache reads/writes.
- Prompt Caching Explained
Plain-language intro with why it’s faster/cheaper and how exact-prefix reuse works.