Infra & Hardware LLM & Generative AI

KV Cache

Key-Value Cache

Difficulty

Plain Explanation

An LLM generates one token at a time. For each new token, the model has to attend to the previous tokens. Without caching, much of the same attention state would be rebuilt again and again. A KV cache stores the key and value tensors produced for earlier tokens, then reuses them while decoding later tokens. This is why KV cache behavior has a direct impact on first-token latency, throughput, memory pressure, and the cost of serving long-context models.

Examples & Analogies

Bookmark analogy: the model does not reread the whole book from page one; it keeps structured notes from what it already read.
Support chatbot: if every request starts with the same long policy prompt, prefix caching can avoid repeated prefill work.
Coding assistant: repeated questions over the same file depend heavily on how efficiently previous context is cached.

At a Glance

Approach	Benefit	Cost	Best fit
No cache	Simple memory behavior	Recomputes the prefix	Small demos, teaching
Standard KV cache	Faster decoding	Memory grows with context length	Most LLM serving
Prefix caching	Reuses shared prompt prefixes	Needs reliable matching keys	Shared system prompts and RAG templates
Offloaded KV	Reduces GPU memory pressure	Adds CPU/SSD transfer latency	Long context under tight GPU memory

Where and Why It Matters

KV cache is less about model intelligence and more about serving economics. Longer context windows create larger K/V tensors, which can reduce batch size and concurrency. That is why serving runtimes such as Hugging Face Transformers, TensorRT-LLM, and vLLM expose different cache strategies, eviction policies, offloading paths, and prefix reuse mechanisms. When a release note talks about long context, lower time to first token (TTFT), prefix caching, or serving throughput, KV cache is usually part of the explanation.

Common Misconceptions

“The cache stores text” → it stores attention key/value tensors, not plain text.
“More cache always means faster” → memory pressure can reduce batching or introduce slower CPU/SSD transfers.
“A larger context window solves it” → larger windows also create larger KV memory requirements.
“Prefix reuse is automatically safe” → tenant, adapter, model, and system-prompt boundaries must be part of the cache key.

How It Sounds in Conversation

“The shared system prompt is large, so let’s measure prefix-cache hit rate.”
“We need to separate model compute from KV eviction when diagnosing time-to-first-token (TTFT) spikes.”
“Offload only helps if transfer cost is lower than recomputation or failed batching.”
“For multi-tenant serving, cache keys need tenant and adapter isolation.”

References

★Paper
Efficient Memory Management for Large Language Model Serving with PagedAttention
Canonical paper framing KV cache memory waste, fragmentation, sharing, and PagedAttention.
★Docs
Cache strategies
Official guide to KV cache basics and Dynamic, Static, Quantized, and Offloaded cache choices.
★Docs
KV Cache System
Runtime view of KV blocks, reuse across requests, eviction, and secure reuse via cache salting.
★Docs
Automatic Prefix Caching: Implementation
Explains KV blocks, prefix hashing, global hash tables, and automatic prefix caching.
·
KV cache offloading
Practical explanation of when offloading trades speed for lower GPU memory pressure.

Helpful?

0to1log Weekly

AI Glossary