Vol.01 · No.10 CS · AI · Infra May 13, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware LLM & Generative AI

PagedAttention

Difficulty

Plain Explanation

Large batches of LLM requests often hit a memory wall: the attention key–value (KV) cache grows token by token and can cap throughput well before compute is saturated. Traditional systems pre-reserve a big contiguous region per request, which wastes space when most responses are short and lengths vary. PagedAttention fixes this by slicing each request’s KV cache into fixed-size blocks and accessing them through a small lookup table, similar to how an operating system pages memory. Because blocks are allocated only as needed, short requests use few blocks, long ones grow smoothly, and blocks don’t have to be adjacent in memory. At each decoding step, the server consults a per-sequence table that maps logical token positions to physical KV blocks so attention kernels can read the right keys/values without moving data. When the current block fills, a new block is allocated and a table entry is added; when a request finishes or is truncated, its blocks are freed and returned to a pool. For multiple candidates from the same prompt (beam search or parallel sampling), the common-prefix blocks are shared and only diverging tokens trigger copy-on-write into new blocks, eliminating redundant KV copies.

Examples & Analogies

  • CI platform with diverse code checks: most requests are brief format or commit-message suggestions, while a few need long static-analysis explanations. PagedAttention lets short jobs occupy just a handful of blocks and long runs grow as needed, so both coexist without wasting KV memory.
  • Multimodal triage: many tickets get short classification-style replies, but some image+text cases require lengthy, step-by-step reasoning. With paged KV blocks on demand, the stack packs many short cases per GPU and still handles occasional long chains without pre-reserving huge buffers.
  • A/B with beam/parallel sampling: teams generate several ad-copy variants from the same prompt. PagedAttention shares KV blocks for the common prefix across variants and allocates new blocks only when each candidate diverges, cutting memory for multi-output decoding.

At a Glance

PagedAttentionContiguous KV per requestFlashAttention
Memory layoutNon-contiguous paged blocks + lookupSingle pre-reserved chunkUnchanged (compute tiling/IO cuts)
FragmentationLow; on-demand blocksHigh; internal/external wasteN/A (compute optimization)
Prefix sharingYes (COW)NoN/A
Kernel needsPaged-aware attention kernelsStandard kernelsSpecialized fast kernels
Best fitMixed-length, multi-user servingUniform, fixed-length jobsAny; speeds attention compute

FlashAttention speeds attention math while PagedAttention fixes KV memory waste; they are complementary.

Where and Why It Matters

  • The original vLLM/PagedAttention paper reports 2–4× higher throughput at comparable latency versus FasterTransformer/Orca in the paper's evaluated workloads, with larger gains for longer sequences, bigger models, and complex decoding; results depend on workload and kernel support.
  • Most value on memory-bound, mixed-length traffic: paging removes fragmentation so schedulers can admit more concurrent sequences per GPU.
  • Decoding strategies that share prefixes: beam/parallel sampling reuse common-prefix blocks, avoiding KV duplication and lowering memory per candidate.
  • Deployment trade-offs: non-contiguous KV requires paged-compatible attention kernels; this adds engineering and maintenance overhead.

Common Misconceptions

  • ❌ PagedAttention is the same as FlashAttention → ✅ They target different bottlenecks: KV memory layout vs attention compute speed.
  • ❌ Works everywhere without kernel changes → ✅ Requires paged-aware kernels and extra bookkeeping in the runtime.
  • ❌ Shrinks model weights significantly → ✅ Focuses on KV cache; weights still occupy GPU memory.

How It Sounds in Conversation

  • "Let’s switch the KV cache to paged blocks so mixed request lengths stop fragmenting our GPUs."
  • "Do our attention kernels support the block-pointer layout, or do we need a paged-compatible path?"
  • "Prefix sharing should cut memory for beam=8; watch the allocated block count during spikes."
  • "Hold p99 latency constant and see if paging lets us raise concurrency without OOMs."
  • "Even with FlashAttention, we’re still memory-bound; PagedAttention should lift tokens/s per GPU."

Related Reading

References

Helpful?