Vol.01 · No.10 CS · AI · Infra May 15, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware LLM & Generative AI

KV Cache

Key-Value Cache

Difficulty

Plain Explanation

An LLM generates one token at a time. For each new token, the model has to attend to the previous tokens. Without caching, much of the same attention state would be rebuilt again and again. A KV cache stores the key and value tensors produced for earlier tokens, then reuses them while decoding later tokens. This is why KV cache behavior has a direct impact on first-token latency, throughput, memory pressure, and the cost of serving long-context models.

Examples & Analogies

  • Bookmark analogy: the model does not reread the whole book from page one; it keeps structured notes from what it already read.
  • Support chatbot: if every request starts with the same long policy prompt, prefix caching can avoid repeated prefill work.
  • Coding assistant: repeated questions over the same file depend heavily on how efficiently previous context is cached.

At a Glance

ApproachBenefitCostBest fit
No cacheSimple memory behaviorRecomputes the prefixSmall demos, teaching
Standard KV cacheFaster decodingMemory grows with context lengthMost LLM serving
Prefix cachingReuses shared prompt prefixesNeeds reliable matching keysShared system prompts and RAG templates
Offloaded KVReduces GPU memory pressureAdds CPU/SSD transfer latencyLong context under tight GPU memory

Where and Why It Matters

KV cache is less about model intelligence and more about serving economics. Longer context windows create larger K/V tensors, which can reduce batch size and concurrency. That is why serving runtimes such as Hugging Face Transformers, TensorRT-LLM, and vLLM expose different cache strategies, eviction policies, offloading paths, and prefix reuse mechanisms. When a release note talks about long context, lower time to first token (TTFT), prefix caching, or serving throughput, KV cache is usually part of the explanation.

Common Misconceptions

  • “The cache stores text” → it stores attention key/value tensors, not plain text.
  • “More cache always means faster” → memory pressure can reduce batching or introduce slower CPU/SSD transfers.
  • “A larger context window solves it” → larger windows also create larger KV memory requirements.
  • “Prefix reuse is automatically safe” → tenant, adapter, model, and system-prompt boundaries must be part of the cache key.

How It Sounds in Conversation

  • “The shared system prompt is large, so let’s measure prefix-cache hit rate.”
  • “We need to separate model compute from KV eviction when diagnosing time-to-first-token (TTFT) spikes.”
  • “Offload only helps if transfer cost is lower than recomputation or failed batching.”
  • “For multi-tenant serving, cache keys need tenant and adapter isolation.”

Related Reading

References

Helpful?