LLM & Generative AI

Prompt Caching

Difficulty

Plain Explanation

Long prompts often repeat the same content, like policies, tool definitions, or examples. Reprocessing those thousands of tokens on every request slows apps and drives up input-token cost. Providers introduced prompt caching to reuse the work for that repeated beginning. You can picture it like a reader who places a bookmark at the end of the shared introduction. Next time, they don’t reread the intro; they jump straight to the new chapter.

With prompt caching, if the first part of the prompt is exactly the same and long enough (commonly 1,024+ tokens), the system reuses computations and starts from where that shared part ends. Concretely, services identify identical prefixes and route those requests so they land on the same cache. On a cache hit, they reuse stored key/value tensors from the model’s attention layers, skipping prefill for those tokens. Vendors may expose controls like a prompt_cache_key and retention options that keep cached prefixes in memory for minutes or, on supported models, for up to 24 hours. Reported metrics (for example, cached_tokens) confirm how much was reused.

Examples & Analogies

Policy-heavy support assistant: A customer-service flow starts with a long, fixed system message describing tone, escalation rules, and refund policy. Each ticket adds just a short, user-specific question at the end, so identical prefixes are reused and responses come back faster.
Tool-enabled coding helper: The app sends the same tool definitions and usage schema at the top of every request, with only the current file diff appended later. Because the front matter matches exactly, the cache skips recomputing those tokens.
Campaign copy generator: A brand brief and style guide stays fixed while marketers generate many variants for headlines and CTAs. The identical prefix triggers cache hits, reducing per-variant latency and cost.

At a Glance

	Prompt caching	Response caching	Semantic caching
What is reused	Model's computed prefix state (KV tensors)	The full previous answer	A stored answer for a similar prompt
Match requirement	Exact identical prefix	Exact identical request	Embedding similarity ("close enough")
Output each call	Recomputed fresh	Replayed verbatim	Replayed from best match
Main benefit	Lower latency and input cost	Zero compute on repeats	Works when text isn’t identical
Typical risk	Minor edits break the match	Stale or misapplied answers	Wrong match for look‑alikes

Prompt caching speeds up requests without changing outputs, while response and semantic caching trade freshness or precision for reuse.

Where and Why It Matters

Cost/latency reductions: Identical, long prefixes can yield substantial latency and input-cost reductions.
Engineering practice shift: Teams front‑load static instructions/examples and push user‑specific details to the end to maximize exact-prefix matches, often tracking cached_tokens.
Billing and retention: Some providers apply discounts or distinct pricing to cached tokens, with in-memory caching for minutes and optional extended retention on supported models (up to 24 hours).
Operational constraints: Benefits occur only with long prompts (commonly 1,024+ tokens) and exact early-token matches; small edits in the initial window result in misses.

Common Misconceptions

❌ Myth: It replays old answers. → ✅ Reality: Only the shared prefix’s computations are reused; the model still generates a fresh response each time.
❌ Myth: Close-enough text will hit the cache. → ✅ Reality: Providers require an exact prefix match in the early tokens; small edits cause misses.
❌ Myth: It always cuts costs. → ✅ Reality: Savings depend on your hit rate and pricing for cache reads and writes.

How It Sounds in Conversation

"Let’s move the static tool schema to the top so cached_tokens actually spikes over 1k."
"We set a stable prompt_cache_key; hit rate should hold as long as the prefix is byte-identical."
"The first user message changed a character in the intro, so we got 0 cached_tokens—make the intro immutable."
"We should schedule a weekly cost/latency review to confirm savings from cache hits."

References

★Paper
GenCache: Generalized Context Caching for LLMs
프리픽스 기반 KV 재사용 연구 동향 개관.
★Docs
Prompt caching
Official guide: exact-prefix matching, 1,024+ tokens, routing hash, cache_key, retention.
★Docs
Prompt caching with Azure OpenAI in Microsoft Foundry Models
Azure details on retention policies, billing discounts, and cache-hit reporting.
★Docs
Prompt caching | Generative AI on Vertex AI
Vertex AI (Claude) explicit caching, TTLs, and pricing for cache reads/writes.
·Blog
Prompt Caching Explained
Plain-language intro with why it’s faster/cheaper and how exact-prefix reuse works.

Helpful?

0to1log Weekly

AI Glossary