Vol.01 · No.10 CS · AI · Infra May 13, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware

KV Offloading

Difficulty

Plain Explanation

During prefill, large language models create Key/Value (KV) tensors that summarize the prompt. With long contexts or heavy concurrency, these tensors fill scarce GPU memory, forcing evictions and recomputation that hurt latency and throughput. KV offloading moves less-active KV from fast but small GPU memory to a larger, cheaper tier like CPU DRAM, then reloads it when needed. This keeps GPUs focused on active decoding while reusing cached context from an external tier. In vLLM as one concrete implementation, a Connector interface integrates with the request lifecycle to asynchronously store/load KV to and from external memory tiers.

Examples & Analogies

  • IDE code assistants analyzing files with long shared headers can store the header KV on CPU and reload it in later sessions to avoid recompute.
  • Document QA across many users benefits by keeping prologue/table-of-contents KV in CPU DRAM so multiple sessions can reuse it.
  • Analytics dashboards running repeated prompt templates can store KV on first run and load it for subsequent variants to stabilize TTFT.

At a Glance

GPU-only KV cacheCPU offloadingAlways recompute
Effective capacityLimited by GPUExpanded via DRAMNo cache
Reload latencyLowestLow (host↔device transfer)Highest (full compute)
Best forShort context, low concurrencyLong context, high concurrency/reuseLittle to no reuse
Preemption handlingRecompute after evictionReload from CPURecompute
Operational complexityLowMedium (transfer/policy)Low

Offloading trades small transfer costs for much larger effective capacity and broader reuse, enabling a single GPU to handle longer contexts and more sessions without unnecessary recomputation.

Where and Why It Matters

  • vLLM’s Connector can asynchronously store/load KV to CPU DRAM or other tiers, reducing recomputation after preemptions and stabilizing time-to-first-token.
  • Backends documented for production (e.g., LMCache, NVIDIA Dynamo) let you choose media per workload: local DRAM, disk, or remote tiers.
  • Ops focus shifts from only adding GPU memory to sizing CPU buffers and tuning hit rate and policies for better throughput.

Common Misconceptions

  • Myth: Offloading always increases latency → Reality: With asynchronous transfers, copies run in parallel with user responses, minimizing perceived delay.
  • Myth: It only helps with shared prefixes → Reality: It also helps under preemptions and high concurrency by avoiding recomputation after evictions.
  • Myth: More GPU memory makes it obsolete → Reality: For long contexts and many users, a tiered cache remains cost-effective and scalable.

How It Sounds in Conversation

  • "Let’s raise the CPU offloading buffer and watch hit rate and queue wait times together."
  • "Preemptions are frequent—verify loads/stores are asynchronous so the scheduler doesn’t stall."
  • "Shared-prefix reuse is high and TTFT is spiky; prioritize the connector’s hit path."
  • "Use a remote tier only for cold KV; keep hot reuse in DRAM."

Related Reading

References

Helpful?