Infra & Hardware

KV Offloading

Difficulty

Plain Explanation

During prefill, large language models create Key/Value (KV) tensors that summarize the prompt. With long contexts or heavy concurrency, these tensors fill scarce GPU memory, forcing evictions and recomputation that hurt latency and throughput. KV offloading moves less-active KV from fast but small GPU memory to a larger, cheaper tier like CPU DRAM, then reloads it when needed. This keeps GPUs focused on active decoding while reusing cached context from an external tier. In vLLM as one concrete implementation, a Connector interface integrates with the request lifecycle to asynchronously store/load KV to and from external memory tiers.

Examples & Analogies

IDE code assistants analyzing files with long shared headers can store the header KV on CPU and reload it in later sessions to avoid recompute.
Document QA across many users benefits by keeping prologue/table-of-contents KV in CPU DRAM so multiple sessions can reuse it.
Analytics dashboards running repeated prompt templates can store KV on first run and load it for subsequent variants to stabilize TTFT.

At a Glance

	GPU-only KV cache	CPU offloading	Always recompute
Effective capacity	Limited by GPU	Expanded via DRAM	No cache
Reload latency	Lowest	Low (host↔device transfer)	Highest (full compute)
Best for	Short context, low concurrency	Long context, high concurrency/reuse	Little to no reuse
Preemption handling	Recompute after eviction	Reload from CPU	Recompute
Operational complexity	Low	Medium (transfer/policy)	Low

Offloading trades small transfer costs for much larger effective capacity and broader reuse, enabling a single GPU to handle longer contexts and more sessions without unnecessary recomputation.

Where and Why It Matters

vLLM’s Connector can asynchronously store/load KV to CPU DRAM or other tiers, reducing recomputation after preemptions and stabilizing time-to-first-token.
Backends documented for production (e.g., LMCache, NVIDIA Dynamo) let you choose media per workload: local DRAM, disk, or remote tiers.
Ops focus shifts from only adding GPU memory to sizing CPU buffers and tuning hit rate and policies for better throughput.

Common Misconceptions

Myth: Offloading always increases latency → Reality: With asynchronous transfers, copies run in parallel with user responses, minimizing perceived delay.
Myth: It only helps with shared prefixes → Reality: It also helps under preemptions and high concurrency by avoiding recomputation after evictions.
Myth: More GPU memory makes it obsolete → Reality: For long contexts and many users, a tiered cache remains cost-effective and scalable.

How It Sounds in Conversation

"Let’s raise the CPU offloading buffer and watch hit rate and queue wait times together."
"Preemptions are frequent—verify loads/stores are asynchronous so the scheduler doesn’t stall."
"Shared-prefix reuse is high and TTFT is spiky; prioritize the connector’s hit path."
"Use a remote tier only for cold KV; keep hot reuse in DRAM."

References

★Docs
KV Cache Offloading | NVIDIA Dynamo Documentation
Overview of vLLM backends (KVBM, LMCache, FlexKV) and serving topologies for offloading.
★Docs
KV Cache Offloading — production-stack
Tutorial showing how to enable LMCache-based offloading and configure CPU buffer size.
★Docs
vLLM API: vllm.v1.kv_offload
Module reference for KV offloading components and abstractions in vLLM v1.
·Docs
AIBrix KVCache Offloading Framework
Explains L1 DRAM caching and optional L2 remote cache for distributed KV reuse.
·Blog
Inside vLLM’s New KV Offloading Connector
Design, motivation, and CLI usage for vLLM’s asynchronous CPU offloading connector.

Helpful?

0to1log Weekly

AI Glossary