KV Offloading
Plain Explanation
During prefill, large language models create Key/Value (KV) tensors that summarize the prompt. With long contexts or heavy concurrency, these tensors fill scarce GPU memory, forcing evictions and recomputation that hurt latency and throughput. KV offloading moves less-active KV from fast but small GPU memory to a larger, cheaper tier like CPU DRAM, then reloads it when needed. This keeps GPUs focused on active decoding while reusing cached context from an external tier. In vLLM as one concrete implementation, a Connector interface integrates with the request lifecycle to asynchronously store/load KV to and from external memory tiers.
Examples & Analogies
- IDE code assistants analyzing files with long shared headers can store the header KV on CPU and reload it in later sessions to avoid recompute.
- Document QA across many users benefits by keeping prologue/table-of-contents KV in CPU DRAM so multiple sessions can reuse it.
- Analytics dashboards running repeated prompt templates can store KV on first run and load it for subsequent variants to stabilize TTFT.
At a Glance
| GPU-only KV cache | CPU offloading | Always recompute | |
|---|---|---|---|
| Effective capacity | Limited by GPU | Expanded via DRAM | No cache |
| Reload latency | Lowest | Low (host↔device transfer) | Highest (full compute) |
| Best for | Short context, low concurrency | Long context, high concurrency/reuse | Little to no reuse |
| Preemption handling | Recompute after eviction | Reload from CPU | Recompute |
| Operational complexity | Low | Medium (transfer/policy) | Low |
Offloading trades small transfer costs for much larger effective capacity and broader reuse, enabling a single GPU to handle longer contexts and more sessions without unnecessary recomputation.
Where and Why It Matters
- vLLM’s Connector can asynchronously store/load KV to CPU DRAM or other tiers, reducing recomputation after preemptions and stabilizing time-to-first-token.
- Backends documented for production (e.g., LMCache, NVIDIA Dynamo) let you choose media per workload: local DRAM, disk, or remote tiers.
- Ops focus shifts from only adding GPU memory to sizing CPU buffers and tuning hit rate and policies for better throughput.
Common Misconceptions
- Myth: Offloading always increases latency → Reality: With asynchronous transfers, copies run in parallel with user responses, minimizing perceived delay.
- Myth: It only helps with shared prefixes → Reality: It also helps under preemptions and high concurrency by avoiding recomputation after evictions.
- Myth: More GPU memory makes it obsolete → Reality: For long contexts and many users, a tiered cache remains cost-effective and scalable.
How It Sounds in Conversation
- "Let’s raise the CPU offloading buffer and watch hit rate and queue wait times together."
- "Preemptions are frequent—verify loads/stores are asynchronous so the scheduler doesn’t stall."
- "Shared-prefix reuse is high and TTFT is spiky; prioritize the connector’s hit path."
- "Use a remote tier only for cold KV; keep hot reuse in DRAM."
Related Reading
References
- KV Cache Offloading | NVIDIA Dynamo Documentation
Overview of vLLM backends (KVBM, LMCache, FlexKV) and serving topologies for offloading.
- KV Cache Offloading — production-stack
Tutorial showing how to enable LMCache-based offloading and configure CPU buffer size.
- vLLM API: vllm.v1.kv_offload
Module reference for KV offloading components and abstractions in vLLM v1.
- AIBrix KVCache Offloading Framework
Explains L1 DRAM caching and optional L2 remote cache for distributed KV reuse.
- Inside vLLM’s New KV Offloading Connector
Design, motivation, and CLI usage for vLLM’s asynchronous CPU offloading connector.