PagedAttention
Plain Explanation
Large batches of LLM requests often hit a memory wall: the attention key–value (KV) cache grows token by token and can cap throughput well before compute is saturated. Traditional systems pre-reserve a big contiguous region per request, which wastes space when most responses are short and lengths vary. PagedAttention fixes this by slicing each request’s KV cache into fixed-size blocks and accessing them through a small lookup table, similar to how an operating system pages memory. Because blocks are allocated only as needed, short requests use few blocks, long ones grow smoothly, and blocks don’t have to be adjacent in memory. At each decoding step, the server consults a per-sequence table that maps logical token positions to physical KV blocks so attention kernels can read the right keys/values without moving data. When the current block fills, a new block is allocated and a table entry is added; when a request finishes or is truncated, its blocks are freed and returned to a pool. For multiple candidates from the same prompt (beam search or parallel sampling), the common-prefix blocks are shared and only diverging tokens trigger copy-on-write into new blocks, eliminating redundant KV copies.
Examples & Analogies
- CI platform with diverse code checks: most requests are brief format or commit-message suggestions, while a few need long static-analysis explanations. PagedAttention lets short jobs occupy just a handful of blocks and long runs grow as needed, so both coexist without wasting KV memory.
- Multimodal triage: many tickets get short classification-style replies, but some image+text cases require lengthy, step-by-step reasoning. With paged KV blocks on demand, the stack packs many short cases per GPU and still handles occasional long chains without pre-reserving huge buffers.
- A/B with beam/parallel sampling: teams generate several ad-copy variants from the same prompt. PagedAttention shares KV blocks for the common prefix across variants and allocates new blocks only when each candidate diverges, cutting memory for multi-output decoding.
At a Glance
| PagedAttention | Contiguous KV per request | FlashAttention | |
|---|---|---|---|
| Memory layout | Non-contiguous paged blocks + lookup | Single pre-reserved chunk | Unchanged (compute tiling/IO cuts) |
| Fragmentation | Low; on-demand blocks | High; internal/external waste | N/A (compute optimization) |
| Prefix sharing | Yes (COW) | No | N/A |
| Kernel needs | Paged-aware attention kernels | Standard kernels | Specialized fast kernels |
| Best fit | Mixed-length, multi-user serving | Uniform, fixed-length jobs | Any; speeds attention compute |
FlashAttention speeds attention math while PagedAttention fixes KV memory waste; they are complementary.
Where and Why It Matters
- The original vLLM/PagedAttention paper reports 2–4× higher throughput at comparable latency versus FasterTransformer/Orca in the paper's evaluated workloads, with larger gains for longer sequences, bigger models, and complex decoding; results depend on workload and kernel support.
- Most value on memory-bound, mixed-length traffic: paging removes fragmentation so schedulers can admit more concurrent sequences per GPU.
- Decoding strategies that share prefixes: beam/parallel sampling reuse common-prefix blocks, avoiding KV duplication and lowering memory per candidate.
- Deployment trade-offs: non-contiguous KV requires paged-compatible attention kernels; this adds engineering and maintenance overhead.
Common Misconceptions
- ❌ PagedAttention is the same as FlashAttention → ✅ They target different bottlenecks: KV memory layout vs attention compute speed.
- ❌ Works everywhere without kernel changes → ✅ Requires paged-aware kernels and extra bookkeeping in the runtime.
- ❌ Shrinks model weights significantly → ✅ Focuses on KV cache; weights still occupy GPU memory.
How It Sounds in Conversation
- "Let’s switch the KV cache to paged blocks so mixed request lengths stop fragmenting our GPUs."
- "Do our attention kernels support the block-pointer layout, or do we need a paged-compatible path?"
- "Prefix sharing should cut memory for beam=8; watch the allocated block count during spikes."
- "Hold p99 latency constant and see if paging lets us raise concurrency without OOMs."
- "Even with FlashAttention, we’re still memory-bound; PagedAttention should lift tokens/s per GPU."
Related Reading
References
- Efficient Memory Management for Large Language Model Serving with PagedAttentionSOSP
Original paper introducing PagedAttention and vLLM with 2–4× throughput results.
- PagedAttention · Hugging Face (Text Generation Inference)
Conceptual docs on partitioning the KV cache and sharing across generations.
- Paged Attention - vLLM
Design overview of paged KV blocks, lookup, and prefix sharing in vLLM.
- PagedAttention: Efficient Memory Management for LLMs
Trade-offs and alternatives, including kernel-compatibility considerations.
- vLLM Explained: PagedAttention, Continuous Batching, and Deploying High-Throughput LLM Inference in Production
Production-focused guide linking PagedAttention with continuous batching.
- Private LLM Deployment Guide: Llama + vLLM + TensorRT — Self-Hosted Architecture
Explains how PagedAttention complements FlashAttention in deployments.