AI NewsResearch

5 min read 5/10/2026

Mixture-of-ExpertsModular LLMsReinforcement LearningRetrievalVision-Language ModelsAgentic Systems

EMO makes sparse models modular, keeping near full performance with 12.5–25% of experts

A 1B‑active, 14B‑total expert model trained on 1 trillion tokens keeps near full‑model quality when loading just 25% (≈1% drop) or 12.5% (≈3% drop) of experts — a concrete path to lower memory use without giving up capability.

Find in this article

Reading Mode

One-Line Summary

Modularity takes center stage: EMO enables selective expert loading in sparse models, while new papers show planning-first agents, stackable VLM adapters, and retrieval-by-grep for stronger agentic search.

LLM & SOTA Models

EMO trains experts to cluster by domain for selective loading

EMO lets you run only a small set of the model’s specialized “experts” for a domain (like code or math) while keeping performance close to the full model — offering a practical way to shrink memory and compute for large language models (LLMs). It is a 1B-active, 14B-total Mixture of Experts (MoE) trained on 1 trillion tokens; keeping 25% of experts costs about 1% absolute performance, and 12.5% costs about 3%. ¹

EMO bakes modularity into pretraining by nudging tokens from the same document to rely on the same small pool of experts. A router first selects a document-level pool by averaging token preferences and then constrains all tokens in that document to route within that pool; load balancing is applied globally across many documents, and the pool size is randomly sampled during training so the model supports different subset sizes at inference. ²

Unlike standard MoEs that often specialize in punctuation or function words, EMO’s experts emerge along semantic lines — clusters such as Health, Code, or News — which is why small subsets still behave like real capabilities. Across general-purpose benchmarks, EMO matches a standard MoE when fully active, and remains robust under selective use: keeping 25% of experts costs about 1% absolute, and 12.5% costs about 3%, before or after fine-tuning. ¹

The team releases models, a matched standard-MoE baseline, code, and an interactive visualization that shows how expert groups form, positioning EMO as a composable architecture for memory-efficient deployment of large, sparse models and inviting further work on expert selection and composition. ¹

Research Papers

StraTA: planning-first reinforcement learning for language agents

StraTA adds a planning step to agent training: from the initial state the agent samples a compact strategy, then conditions its actions on that plan, improving long-horizon control for Large Language Model (LLM) agents. Using a hierarchical Group Relative Policy Optimization (GRPO) design, StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop, and a 63.5% overall score on SciWorld, surpassing strong baselines and even frontier closed-source models on SciWorld. ³

The framework jointly trains the strategy generator and the action executor, augments exploration with diverse strategy rollouts, and adds critical self-judgment to tighten credit assignment over long trajectories — addressing two persistent issues for agentic Reinforcement Learning (RL): exploration in sparse tasks and delayed rewards. ³

For practitioners, the value is sample efficiency and stability on multi-step web and science tasks; the open questions are how well the abstraction generalizes across tools and environments and how to detect when to revise a plan mid-episode. ³

GeoStack: modular composition to preserve VLM knowledge

GeoStack composes multiple domain experts into a single Vision-Language Model (VLM) while preserving what the base model already knows. It does this by stacking adapters with geometric and structural constraints on the adapter manifold, and proves a “weight-folding” property that gives constant-time (O(1)) inference regardless of how many experts you integrate. ⁴

Across multi-domain adaptation and class-incremental learning, GeoStack mitigates catastrophic forgetting and maintains efficiency, providing a modular path to add skills without retraining the whole model; code is available in the accompanying repository. ⁴

Contextually, this speaks to the stability–plasticity trade-off in continual learning: complementary work on “Forgetting through Adaptive Decay (FADE)” shows that dynamically learning per-parameter decay can halve tracking error versus AdamW in streaming settings, underscoring the push toward mechanisms that keep models adaptable without erasing old knowledge. ⁵

Direct corpus interaction: agentic search without embeddings

Direct Corpus Interaction (DCI) lets an agent search the raw corpus with general-purpose tools like grep, file reads, shell commands, or lightweight scripts — no embedding model, vector index, or retrieval Application Programming Interface (API) required. That removes offline indexing and makes it easy to adapt to evolving local files. ⁶

On Information Retrieval (IR) benchmarks and end-to-end agentic search tasks, this simple interface substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop Question Answering (QA) — without any conventional semantic retriever. ⁶

The takeaway is that as agents get stronger, not just reasoning quality but also the resolution of their retrieval interface matters; DCI opens more flexible, inspectable ways for agents to interrogate data, at the cost of writing more targeted search commands. ⁶

Open Source & Repos

Pi agent toolkit ships unified LLM API and coding CLI

Pi is a mono-repo “agent harness” with batteries-included pieces for building coding agents: an interactive coding agent Command-Line Interface (CLI), a unified Large Language Model (LLM) Application Programming Interface (API), Text-based and web UI libraries, a Slack bot, and deployment scaffolding for vLLM pods. ⁷

The maintainers note that new issues and pull requests from first-time contributors are auto-closed and later reviewed, and the v0.74.0 release updates repository links and package scopes to earendil-works/pi-mono and @earendil-works/* namespaces. ⁷

For teams prototyping agent workflows across terminal, chat, and web surfaces, this consolidates common plumbing in one place; expect incremental updates rather than a one-click framework, and check the repo’s scopes and contribution policy before adopting. ⁷

Why It Matters

This batch spotlights modularity as a practical lever: EMO shows sparse experts can be trained to line up with real domains so a small subset runs close to full quality, while GeoStack demonstrates you can add domain modules to VLMs without wiping out prior knowledge. ¹

At the same time, StraTA and DCI argue for better interfaces — one at the decision-making level (plan first, then act), the other at the data level (search precisely, not just by similarity) — together pointing toward agents that are both more sample-efficient and easier to deploy within real memory and compute limits. ⁶

This Week: Try It

Explore EMO’s interactive expert-cluster visualization to see how domains emerge during training (no install needed). ¹
Install the Pi coding agent CLI to scaffold a local agent and experiment with a unified LLM API. ⁷

Sources 7

[1] Huggingface EMO: Pretraining mixture of experts for emergent modularity [2] Arxiv EMO: Pretraining Mixture of Experts for Emergent Modularity [3] Arxiv StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction [4] Arxiv GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs [5] Substack Learning to Forget: Continual Learning with Adaptive Weight Decay [6] Arxiv Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction [7] Github earendil-works/pi: Pi Agent Harness Mono Repo

Helpful?

0to1log Weekly

Latest AI News