Vol.01 · No.10 CS · AI · Infra May 13, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI Deep Learning

MoE

Mixture of Experts

Difficulty

Plain Explanation

Language models often improve as parameters grow, but dense designs make every token traverse all weights, raising cost. MoE (Mixture of Experts) adds many specialist feed-forward networks (experts) and a router that picks a small top‑k subset per token. Like a switchboard, each token is connected to a few relevant specialists rather than all of them. In practice, a Transformer block’s single FFN is replaced by a bank of expert FFNs; the router scores experts for each token, selects top‑k, and a combiner merges those outputs for the next layer. Total capacity can grow with the number of experts, while per‑token compute stays near k FFN passes.

Examples & Analogies

  • Multilingual text: the router steers tokens toward experts tuned to particular language patterns.
  • Numeric/symbol-heavy inputs: tokens with mathematical or tabular structure activate experts that capture such patterns.
  • Multi-domain corpora: tokens from different domains are routed to specialists, retaining capacity without running all experts.

At a Glance

Dense Transformer (single FFN)MoE (sparse experts)
FFN structureOne FFN per blockMany expert FFNs per block
Activation per tokenAll weights usedOnly top‑k experts run
Total parametersScales with active computeScales via experts; active params ~k FFNs
Compute per token~1 FFN pass~k FFN passes + routing/combining
System implicationsSingle pathRouting and token dispatch required

Where and Why It Matters

  • Capacity scaling: increase experts (E) to expand representational capacity while keeping per‑token compute near k.
  • Common integration: replace the FFN inside Transformer blocks with an MoE layer.
  • Systems focus: sparse, dynamic routing makes communication, buffering, and scheduling critical.
  • Interpretability angle: analyzing which experts activate for which inputs is a common research direction.

Common Misconceptions

  • ❌ All experts run on every token → ✅ Only a small top‑k subset is activated.
  • ❌ MoE is automatically cheaper → ✅ Routing/communication add overhead; engineering is required to realize gains.
  • ❌ MoE changes attention → ✅ Typically, it replaces the FFN path, not attention.

How It Sounds in Conversation

  • "Swap this block’s FFN for an MoE layer and start with k=2."
  • "Routing collapses onto a few experts; let’s add a balancing term and re‑train."
  • "Routing and combining are driving latency; profile dispatch and sharding."

Related Reading

References

Helpful?