MoE
Mixture of Experts
Plain Explanation
Language models often improve as parameters grow, but dense designs make every token traverse all weights, raising cost. MoE (Mixture of Experts) adds many specialist feed-forward networks (experts) and a router that picks a small top‑k subset per token. Like a switchboard, each token is connected to a few relevant specialists rather than all of them. In practice, a Transformer block’s single FFN is replaced by a bank of expert FFNs; the router scores experts for each token, selects top‑k, and a combiner merges those outputs for the next layer. Total capacity can grow with the number of experts, while per‑token compute stays near k FFN passes.
Examples & Analogies
- Multilingual text: the router steers tokens toward experts tuned to particular language patterns.
- Numeric/symbol-heavy inputs: tokens with mathematical or tabular structure activate experts that capture such patterns.
- Multi-domain corpora: tokens from different domains are routed to specialists, retaining capacity without running all experts.
At a Glance
| Dense Transformer (single FFN) | MoE (sparse experts) | |
|---|---|---|
| FFN structure | One FFN per block | Many expert FFNs per block |
| Activation per token | All weights used | Only top‑k experts run |
| Total parameters | Scales with active compute | Scales via experts; active params ~k FFNs |
| Compute per token | ~1 FFN pass | ~k FFN passes + routing/combining |
| System implications | Single path | Routing and token dispatch required |
Where and Why It Matters
- Capacity scaling: increase experts (E) to expand representational capacity while keeping per‑token compute near k.
- Common integration: replace the FFN inside Transformer blocks with an MoE layer.
- Systems focus: sparse, dynamic routing makes communication, buffering, and scheduling critical.
- Interpretability angle: analyzing which experts activate for which inputs is a common research direction.
Common Misconceptions
- ❌ All experts run on every token → ✅ Only a small top‑k subset is activated.
- ❌ MoE is automatically cheaper → ✅ Routing/communication add overhead; engineering is required to realize gains.
- ❌ MoE changes attention → ✅ Typically, it replaces the FFN path, not attention.
How It Sounds in Conversation
- "Swap this block’s FFN for an MoE layer and start with k=2."
- "Routing collapses onto a few experts; let’s add a balancing term and re‑train."
- "Routing and combining are driving latency; profile dispatch and sharding."
Related Reading
References
- A Survey on Mixture of Experts
Comprehensive MoE survey: structure, gating designs, system issues, and applications.
- Mixture of Experts Made Intrinsically Interpretable
Explores MoE variants aimed at improving interpretability of expert activations.
- The Rise of Sparse Mixture-of-Experts: A Survey from Algorithmic Foundations to Decentralized Architectures and Vertical Domain Applications
Survey emphasizing sparse MoE, routing, and decentralized system directions and domains.
- Mixture-of-Experts (MoE) LLMs
Clear overview of MoE in LLMs; explains experts in FFN and routing trade-offs.
- A Visual Guide to Mixture of Experts (MoE)
Visual explanation of experts and router behavior in MoE architectures.