Vol.01 · No.10 CS · AI · Infra May 30, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware Deep Learning ML Fundamentals

GPU

Difficulty

Plain Explanation

Modern AI repeats the same numeric patterns at huge scale: matrix multiplication, normalization, sampling, and vector scoring. CPUs are strong at complex control flow; GPUs are built for SIMT execution, where thousands of lightweight threads apply the same operation to different pieces of data. HBM is the high-bandwidth memory next to the GPU, and it often determines whether inference feels fast because model weights and KV cache data must be read continuously.

For LLM serving, split the problem into prefill and decode. Prefill processes the input context in larger chunks and is usually more compute-heavy. Decode creates new tokens one step at a time and repeatedly touches weights and KV cache, so memory bandwidth and GPU-to-GPU communication can dominate. In practice, "how many GPUs?" is less useful than "does the model fit in HBM?", "are shards inside the same NVLink/NVSwitch domain?", and "does the network path use RDMA or GPUDirect RDMA?"

Examples & Analogies

  • Long‑document LLM: the first pass is math‑heavy; token generation stresses memory bandwidth.
  • Streaming ASR→vocoder: early stages are compute‑centric, later ones bandwidth‑sensitive; separating them on the GPU lowers latency.
  • Large embedding scoring: parallelize many identical dot products to shrink batch windows.

At a Glance

GPUSpecialized AI accelerator
Ecosystem/toolingBroad and matureVaries by platform
Latency/throughputStrong at larger batchesOften best at tiny‑batch latency
ScaleNVLink/NVSwitch scale‑upTopology‑specialized fabrics

Where and Why It Matters

  • Model fit and memory: weights, KV cache, and batch headroom must fit in HBM for stable serving.
  • Throughput vs latency: larger batches can raise tokens/sec while hurting decode p95/p99, so microbatching and queue policy matter.
  • Multi-GPU placement: tensor or pipeline parallelism splits compute but adds collectives; keeping shards inside one NVLink/NVSwitch island reduces synchronization time.
  • Cluster networking: RAG, multimodal, and distributed inference often move data across nodes, where SR-IOV, RDMA, and GPUDirect RDMA reduce tail latency.
  • Cost control: the most expensive GPU is not always the right answer. Smaller models or low QPS may fit CPU/smaller accelerators; long-context high-QPS workloads often justify GPU pools.

Common Misconceptions

  • ❌ More GPUs always lower latency → ✅ Communication can dominate and raise p99.
  • ❌ GPU memory capacity is the only spec that matters → ✅ Bandwidth, interconnect, kernel efficiency, and queueing matter too.
  • ❌ One benchmark generalizes → ✅ Bottlenecks shift with model, batch size, sequence length, precision, and topology.

How It Sounds in Conversation

  • "Decode is bandwidth‑bound; let’s microbatch."
  • "Pin shards to the same NVLink domain; avoid cross‑node hops."
  • "Overlap copies with compute using async transfers to hide stalls."

Related Reading

References

Helpful?