LLM & Generative AI Deep Learning

MoE

Mixture of Experts

Difficulty

Plain Explanation

Language models often improve as parameters grow, but dense designs make every token traverse all weights, raising cost. MoE (Mixture of Experts) adds many specialist feed-forward networks (experts) and a router that picks a small top‑k subset per token. Like a switchboard, each token is connected to a few relevant specialists rather than all of them. In practice, a Transformer block’s single FFN is replaced by a bank of expert FFNs; the router scores experts for each token, selects top‑k, and a combiner merges those outputs for the next layer. Total capacity can grow with the number of experts, while per‑token compute stays near k FFN passes.

Examples & Analogies

Multilingual text: the router steers tokens toward experts tuned to particular language patterns.
Numeric/symbol-heavy inputs: tokens with mathematical or tabular structure activate experts that capture such patterns.
Multi-domain corpora: tokens from different domains are routed to specialists, retaining capacity without running all experts.

At a Glance

	Dense Transformer (single FFN)	MoE (sparse experts)
FFN structure	One FFN per block	Many expert FFNs per block
Activation per token	All weights used	Only top‑k experts run
Total parameters	Scales with active compute	Scales via experts; active params ~k FFNs
Compute per token	~1 FFN pass	~k FFN passes + routing/combining
System implications	Single path	Routing and token dispatch required

Where and Why It Matters

Capacity scaling: increase experts (E) to expand representational capacity while keeping per‑token compute near k.
Common integration: replace the FFN inside Transformer blocks with an MoE layer.
Systems focus: sparse, dynamic routing makes communication, buffering, and scheduling critical.
Interpretability angle: analyzing which experts activate for which inputs is a common research direction.

Common Misconceptions

❌ All experts run on every token → ✅ Only a small top‑k subset is activated.
❌ MoE is automatically cheaper → ✅ Routing/communication add overhead; engineering is required to realize gains.
❌ MoE changes attention → ✅ Typically, it replaces the FFN path, not attention.

How It Sounds in Conversation

"Swap this block’s FFN for an MoE layer and start with k=2."
"Routing collapses onto a few experts; let’s add a balancing term and re‑train."
"Routing and combining are driving latency; profile dispatch and sharding."

References

★Paper2024
A Survey on Mixture of ExpertsWeilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
Comprehensive MoE survey: structure, gating designs, system issues, and applications.
★Paper2025
Mixture of Experts Made Intrinsically InterpretableXingyi Yang et al.
Explores MoE variants aimed at improving interpretability of expert activations.
★Paper2026
The Rise of Sparse Mixture-of-Experts: A Survey from Algorithmic Foundations to Decentralized Architectures and Vertical Domain ApplicationsDong Pan, Bingtao Li, Yongsheng Zheng, Jiren Ma, Victor Fei
Survey emphasizing sparse MoE, routing, and decentralized system directions and domains.
·Blog
Mixture-of-Experts (MoE) LLMsCameron R. Wolfe
Clear overview of MoE in LLMs; explains experts in FFN and routing trade-offs.
·Blog
A Visual Guide to Mixture of Experts (MoE)Maarten Grootendorst
Visual explanation of experts and router behavior in MoE architectures.

Helpful?

0to1log Weekly

AI Glossary

MoE