KV Cache
Key-Value Cache
Plain Explanation
An LLM generates one token at a time. For each new token, the model has to attend to the previous tokens. Without caching, much of the same attention state would be rebuilt again and again. A KV cache stores the key and value tensors produced for earlier tokens, then reuses them while decoding later tokens. This is why KV cache behavior has a direct impact on first-token latency, throughput, memory pressure, and the cost of serving long-context models.
Examples & Analogies
- Bookmark analogy: the model does not reread the whole book from page one; it keeps structured notes from what it already read.
- Support chatbot: if every request starts with the same long policy prompt, prefix caching can avoid repeated prefill work.
- Coding assistant: repeated questions over the same file depend heavily on how efficiently previous context is cached.
At a Glance
| Approach | Benefit | Cost | Best fit |
|---|---|---|---|
| No cache | Simple memory behavior | Recomputes the prefix | Small demos, teaching |
| Standard KV cache | Faster decoding | Memory grows with context length | Most LLM serving |
| Prefix caching | Reuses shared prompt prefixes | Needs reliable matching keys | Shared system prompts and RAG templates |
| Offloaded KV | Reduces GPU memory pressure | Adds CPU/SSD transfer latency | Long context under tight GPU memory |
Where and Why It Matters
KV cache is less about model intelligence and more about serving economics. Longer context windows create larger K/V tensors, which can reduce batch size and concurrency. That is why serving runtimes such as Hugging Face Transformers, TensorRT-LLM, and vLLM expose different cache strategies, eviction policies, offloading paths, and prefix reuse mechanisms. When a release note talks about long context, lower time to first token (TTFT), prefix caching, or serving throughput, KV cache is usually part of the explanation.
Common Misconceptions
- “The cache stores text” → it stores attention key/value tensors, not plain text.
- “More cache always means faster” → memory pressure can reduce batching or introduce slower CPU/SSD transfers.
- “A larger context window solves it” → larger windows also create larger KV memory requirements.
- “Prefix reuse is automatically safe” → tenant, adapter, model, and system-prompt boundaries must be part of the cache key.
How It Sounds in Conversation
- “The shared system prompt is large, so let’s measure prefix-cache hit rate.”
- “We need to separate model compute from KV eviction when diagnosing time-to-first-token (TTFT) spikes.”
- “Offload only helps if transfer cost is lower than recomputation or failed batching.”
- “For multi-tenant serving, cache keys need tenant and adapter isolation.”
Related Reading
References
- Efficient Memory Management for Large Language Model Serving with PagedAttention
Canonical paper framing KV cache memory waste, fragmentation, sharing, and PagedAttention.
- Cache strategies
Official guide to KV cache basics and Dynamic, Static, Quantized, and Offloaded cache choices.
- KV Cache System
Runtime view of KV blocks, reuse across requests, eviction, and secure reuse via cache salting.
- Automatic Prefix Caching: Implementation
Explains KV blocks, prefix hashing, global hash tables, and automatic prefix caching.
- KV cache offloading
Practical explanation of when offloading trades speed for lower GPU memory pressure.