Model parallelism
Model Parallelism
Plain Explanation
Big models outgrew single‑GPU memory. Even if you shrink the batch, a large language model can hold hundreds of layers and billions of parameters, easily exceeding one device. Past a point (often around 10B+ parameters in FP16), you can’t fit the weights and activations on one card at all.
Model parallelism solves this by spreading the model across multiple GPUs and letting data flow through them. Picture an assembly line: each GPU owns part of the model, so the intermediate results (activations) move from one GPU to the next until the full forward pass is done. You can split by stages (pipeline parallelism), slice big tensors so multiple GPUs multiply chunks in parallel (tensor parallelism), or send tokens to specialized sub‑networks (expert parallelism) when using mixture‑of‑experts.
Mechanically, this reduces per‑GPU memory because no single device holds the whole model. The trade‑off is communication: activations and partial results must hop across GPUs, so throughput and latency depend on interconnect bandwidth and scheduling. Pipeline parallelism improves device utilization versus naïve layer‑splitting, while tensor parallelism accelerates large matrix ops by sharing them across GPUs; both still rely on fast GPU‑to‑GPU links for good performance.
Examples & Analogies
- 4‑GPU layer split for a 32‑layer LLM: A 32‑layer model is deployed so each GPU holds 8 layers. Requests flow GPU‑to‑GPU in sequence via point‑to‑point communication, allowing the model to run even though it can’t fit on a single card.
- Mixture‑of‑Experts routing: An MoE model routes tokens only to a few active experts spread across GPUs. Inactive experts don’t compute, reducing per‑request compute, but responses must be reassembled across GPUs, increasing network traffic on the interconnect fabric.
- Serving with sharded weights and moving KV cache blocks: During inference, model data services move weights, tensors, and KV cache blocks between tiers and devices to manage memory pressure. This helps keep hot data on GPU while staging large payloads needed to serve LLM requests.
At a Glance
| Tensor parallelism | Pipeline parallelism | Expert parallelism (MoE) | |
|---|---|---|---|
| What is split | Matrices/tensors within a layer | Layers/stages across GPUs | Experts/sub‑networks across GPUs |
| Goal | Speed big matmuls, cut per‑GPU memory | Fit deep models, keep GPUs busy | Activate few experts per token to save compute |
| Communication | Frequent all‑reduce/all‑gather | Stage‑to‑stage activations | Token routing + result gathering |
| Utilization behavior | Parallel within layers | Parallel across micro‑batches | Parallel across active experts |
| Sensitivity | Interconnect bandwidth/latency | Scheduling + interconnect | Interconnect for routing/merge |
These model‑parallel flavors trade memory relief for added cross‑GPU communication, and the best choice depends on where your bottleneck sits: within layers, across depth, or in sparse expert routing.
Where and Why It Matters
- NVIDIA Megatron‑based training stacks: Mix tensor, pipeline, sequence, context, and expert parallelism to scale LLMs, reducing per‑GPU memory while sustaining training performance.
- Serving large LLMs under memory pressure: Model‑data planes move weights, tensors, and KV cache blocks so models that exceed a single GPU can still be served, with deployment bottlenecks shifting to model load and data movement.
- When a single GPU is not enough: Once model memory exceeds a single card (typical for large LLMs around 10B+ parameters in FP16), model parallelism becomes a practical requirement to train or deploy.
- Engineering practice shift: Teams benchmark different parallelism mixes on their interconnect fabric, since throughput and latency hinge on GPU‑to‑GPU bandwidth and scheduling rather than just raw FLOPs.
Common Misconceptions
- ❌ Myth: Model parallelism always speeds up a single request. → ✅ Reality: Extra cross‑GPU communication can add latency; benefits are often memory capacity and throughput, not per‑request speed.
- ❌ Myth: It’s only a training trick. → ✅ Reality: Inference stacks also shard models and move KV cache blocks and tensors to serve large LLMs that don’t fit on one GPU.
- ❌ Myth: More GPUs always means better performance. → ✅ Reality: Network topology and bandwidth can bottleneck; the wrong split can under‑utilize GPUs.
How It Sounds in Conversation
- "Let’s bump tensor_model_parallel_size to 2 and pipeline_model_parallel_size to 4; the current shard config is memory‑safe but under‑utilizing the fabric."
- "Our latency spike aligns with cross‑GPU all‑gathers; we need to profile interconnect bandwidth before increasing TP again."
- "For the 32‑layer deployment on 4 GPUs, interleave stages so pipeline bubbles shrink and throughput improves."
- "KV cache growth is pushing us into model‑data plane transfers; can Kubernetes placement keep pipeline neighbors on the same host?"
- "MoE routing looks chatty—active experts span nodes; we should co‑locate experts or cap expert_model_parallel_size for this SLA."
Related Reading
References
- Parallelisms Guide — Megatron Bridge
Official guide covering tensor, pipeline, sequence, context, and expert parallelism.
- NVIDIA Inference Reference Architecture
Reference stack for serving LLMs, including model-data movement of weights, tensors, and KV cache.
- Model Deployment and Orchestration
Deployment context: Kubernetes, scalability, and observability for production models.
- What is Inference Parallelism and How it Works
Shows practical parallelism combinations and notes interconnect-driven latency.
- The Complete Guide to Multi-GPU Training: Scaling AI Models Beyond Single-Card Limitations
Explains model vs pipeline parallelism and why inter-GPU communication is critical.