Infra & Hardware Deep Learning ML Fundamentals

GPU

Difficulty

Plain Explanation

Modern AI repeats the same numeric patterns at huge scale: matrix multiplication, normalization, sampling, and vector scoring. CPUs are strong at complex control flow; GPUs are built for SIMT execution, where thousands of lightweight threads apply the same operation to different pieces of data. HBM is the high-bandwidth memory next to the GPU, and it often determines whether inference feels fast because model weights and KV cache data must be read continuously.

For LLM serving, split the problem into prefill and decode. Prefill processes the input context in larger chunks and is usually more compute-heavy. Decode creates new tokens one step at a time and repeatedly touches weights and KV cache, so memory bandwidth and GPU-to-GPU communication can dominate. In practice, "how many GPUs?" is less useful than "does the model fit in HBM?", "are shards inside the same NVLink/NVSwitch domain?", and "does the network path use RDMA or GPUDirect RDMA?"

Examples & Analogies

Long‑document LLM: the first pass is math‑heavy; token generation stresses memory bandwidth.
Streaming ASR→vocoder: early stages are compute‑centric, later ones bandwidth‑sensitive; separating them on the GPU lowers latency.
Large embedding scoring: parallelize many identical dot products to shrink batch windows.

At a Glance

	GPU	Specialized AI accelerator
Ecosystem/tooling	Broad and mature	Varies by platform
Latency/throughput	Strong at larger batches	Often best at tiny‑batch latency
Scale	NVLink/NVSwitch scale‑up	Topology‑specialized fabrics

Where and Why It Matters

Model fit and memory: weights, KV cache, and batch headroom must fit in HBM for stable serving.
Throughput vs latency: larger batches can raise tokens/sec while hurting decode p95/p99, so microbatching and queue policy matter.
Multi-GPU placement: tensor or pipeline parallelism splits compute but adds collectives; keeping shards inside one NVLink/NVSwitch island reduces synchronization time.
Cluster networking: RAG, multimodal, and distributed inference often move data across nodes, where SR-IOV, RDMA, and GPUDirect RDMA reduce tail latency.
Cost control: the most expensive GPU is not always the right answer. Smaller models or low QPS may fit CPU/smaller accelerators; long-context high-QPS workloads often justify GPU pools.

Common Misconceptions

❌ More GPUs always lower latency → ✅ Communication can dominate and raise p99.
❌ GPU memory capacity is the only spec that matters → ✅ Bandwidth, interconnect, kernel efficiency, and queueing matter too.
❌ One benchmark generalizes → ✅ Bottlenecks shift with model, batch size, sequence length, precision, and topology.

How It Sounds in Conversation

"Decode is bandwidth‑bound; let’s microbatch."
"Pin shards to the same NVLink domain; avoid cross‑node hops."
"Overlap copies with compute using async transfers to hide stalls."

References

★Docs
CUDA C++ Programming GuideNVIDIA
Reference for kernels, warps, blocks, memory hierarchy, and CUDA execution.
★Docs
CUDA C++ Best Practices GuideNVIDIA
Performance guidance for occupancy, memory coalescing, and transfers.
★Docs
NVIDIA Collective Communication Library DocumentationNVIDIA
Operational reference for multi-GPU collectives and topology behavior.
★Docs
GPUDirect RDMA DocumentationNVIDIA
Explains low-latency data paths between GPU memory and network devices.
★Docs
NVIDIA Inference Reference ArchitectureNVIDIA
Inference guidance for batching, networking, locality, and validation.

Helpful?

0to1log Weekly

AI Glossary

GPU