GPU
Plain Explanation
Modern AI repeats the same numeric patterns at huge scale: matrix multiplication, normalization, sampling, and vector scoring. CPUs are strong at complex control flow; GPUs are built for SIMT execution, where thousands of lightweight threads apply the same operation to different pieces of data. HBM is the high-bandwidth memory next to the GPU, and it often determines whether inference feels fast because model weights and KV cache data must be read continuously.
For LLM serving, split the problem into prefill and decode. Prefill processes the input context in larger chunks and is usually more compute-heavy. Decode creates new tokens one step at a time and repeatedly touches weights and KV cache, so memory bandwidth and GPU-to-GPU communication can dominate. In practice, "how many GPUs?" is less useful than "does the model fit in HBM?", "are shards inside the same NVLink/NVSwitch domain?", and "does the network path use RDMA or GPUDirect RDMA?"
Examples & Analogies
- Long‑document LLM: the first pass is math‑heavy; token generation stresses memory bandwidth.
- Streaming ASR→vocoder: early stages are compute‑centric, later ones bandwidth‑sensitive; separating them on the GPU lowers latency.
- Large embedding scoring: parallelize many identical dot products to shrink batch windows.
At a Glance
| GPU | Specialized AI accelerator | |
|---|---|---|
| Ecosystem/tooling | Broad and mature | Varies by platform |
| Latency/throughput | Strong at larger batches | Often best at tiny‑batch latency |
| Scale | NVLink/NVSwitch scale‑up | Topology‑specialized fabrics |
Where and Why It Matters
- Model fit and memory: weights, KV cache, and batch headroom must fit in HBM for stable serving.
- Throughput vs latency: larger batches can raise tokens/sec while hurting decode p95/p99, so microbatching and queue policy matter.
- Multi-GPU placement: tensor or pipeline parallelism splits compute but adds collectives; keeping shards inside one NVLink/NVSwitch island reduces synchronization time.
- Cluster networking: RAG, multimodal, and distributed inference often move data across nodes, where SR-IOV, RDMA, and GPUDirect RDMA reduce tail latency.
- Cost control: the most expensive GPU is not always the right answer. Smaller models or low QPS may fit CPU/smaller accelerators; long-context high-QPS workloads often justify GPU pools.
Common Misconceptions
- ❌ More GPUs always lower latency → ✅ Communication can dominate and raise p99.
- ❌ GPU memory capacity is the only spec that matters → ✅ Bandwidth, interconnect, kernel efficiency, and queueing matter too.
- ❌ One benchmark generalizes → ✅ Bottlenecks shift with model, batch size, sequence length, precision, and topology.
How It Sounds in Conversation
- "Decode is bandwidth‑bound; let’s microbatch."
- "Pin shards to the same NVLink domain; avoid cross‑node hops."
- "Overlap copies with compute using async transfers to hide stalls."
Related Reading
References
- CUDA C++ Programming Guide
Reference for kernels, warps, blocks, memory hierarchy, and CUDA execution.
- CUDA C++ Best Practices Guide
Performance guidance for occupancy, memory coalescing, and transfers.
- NVIDIA Collective Communication Library Documentation
Operational reference for multi-GPU collectives and topology behavior.
- GPUDirect RDMA Documentation
Explains low-latency data paths between GPU memory and network devices.
- NVIDIA Inference Reference Architecture
Inference guidance for batching, networking, locality, and validation.