AI NewsResearch

9 min read 4/9/2026

Google DeepMindGemma 4Mixture-of-ExpertsmultimodalvLLMevaluation

Google’s Gemma 4 pushes open-weight multimodality and agent readiness to the Pareto frontier

A 31B dense and 26B MoE with 256K context, native tool calling, and day‑zero vLLM/TPU support reframes what “open” means. But independent ablations and full safety evals still lag the hype.

Find in this article

Reading Mode

One-Line Summary

Google DeepMind releases Gemma 4 as a fully Apache 2.0 open-weight, multimodal family spanning phones to single H100s, while Microsoft debuts a 15B Phi-4 vision-reasoning model, and new studies probe expert-merging for MoE compression, agent safety in office-like workflows, and per-document utility in RAG.

LLM & SOTA Models

Gemma 4: Byte-for-byte performance with Apache 2.0

Google DeepMind unveils Gemma 4 in four sizes — Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE), and 31B Dense — all open-weight under the Apache 2.0 license. The 31B ranks #3 and the 26B ranks #6 among open models on Arena AI, with 128K–256K token context windows, native function-calling, structured JSON, and vision/audio inputs (audio on E2B/E4B). Notably, the 26B MoE activates only about 3.8B parameters per token for fast latency, and both 26B and 31B fit on a single 80 GB NVIDIA H100 in bfloat16. ¹

The release emphasizes “intelligence-per-parameter,” citing cases where Gemma 4 outcompetes models up to 20× larger in blind preferences. The family is designed to be practical across hardware: E2B/E4B target mobile and IoT with offline, near-zero latency multimodality, while larger models power local IDEs and agentic workflows. All variants support 140+ languages, with explicit support across popular tooling (Transformers, vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM/NeMo, and more) from day one. ²

Early third-party testing highlights reasoning efficiency: on GPQA Diamond, Gemma 4 31B scores 85.7% (second among sub-40B open models), using around 1.2M output tokens — fewer than similarly strong Qwen3.5 entries — translating to lower inference cost for chain-of-thought and agent flows. Community reports note strong gains on MMLU-Pro, AIME 2026, BigBench Hard, and multimodal tasks, and edge variants aim for on-device utility gains (reports cite up to 4× speed and 60% lower battery than prior gen). ³

Phi-4-reasoning-vision-15B: Smaller multimodal reasoning, tuned by data and fusion choices

Microsoft Research introduces Phi-4-reasoning-vision-15B, an open‑weight 15B multimodal reasoning model that targets math/science and user-interface understanding. It follows a “mid-fusion” approach: a pretrained SigLIP‑2 vision encoder projects visual tokens into a Phi‑4‑Reasoning backbone, balancing accuracy and compute vs. costlier “early-fusion.” The team trained with about 200B multimodal tokens and leveraged a 16B-token Phi‑4‑Reasoning stage on a 400B-unique-token Phi‑4 base — far less than 1T+ token regimes reported for recent open VLMs. ⁴

They report a Pareto-frontier shift in accuracy vs. compute/time vs. output tokens across ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2. An ablation on image processing finds dynamic-resolution encoders strongest on high-res data; using SigLIP‑2’s Naflex variant with up to roughly 3,600 visual tokens boosts performance on high-DPI UI tasks (e.g., ScreenSpot‑Pro), compared with multi-crop or dynamic S2-only pipelines. The takeaway: careful fusion, vision encoders, and data curation can match much larger systems on practical workloads. ⁴

Beyond capabilities, the blog details a data triage pipeline: manual sampling to flag errors, programmatic fixes, GPT‑4o/o4‑mini regenerations where viable, and turning quality images into new VQA/caption datasets. This targeted, quality-first curation — rather than sheer scale — underpins the model’s efficiency and competitive math/UI grounding, offering a lightweight option for modest hardware without sacrificing structured reasoning. ⁴

Open Source & Repos

vLLM Day‑0 support for Gemma 4 across Nvidia, AMD, Intel, and TPUs

Inference engine vLLM rolls out same‑day Gemma 4 support, spanning Nvidia, AMD, Intel XPUs, and Google TPUs. That brings native handling of Gemma 4’s multimodal inputs, agentic features (function-calling, structured JSON, system instructions), and long contexts (128K for edge, 256K for larger) into a widely used open stack. For practitioners, it means immediate deployment options from laptops to datacenter accelerators. ⁵

vLLM emphasizes performance portability: one stack to scale from GKE/GCE with Trillium/Ironwood TPUs to Nvidia GPUs, while also supporting edge-like use cases where image/video inputs and extended contexts are critical. This complements Google’s promise of broad ecosystem integrations (Transformers, llama.cpp, MLX, Ollama, NIM/NeMo) and eases fine-tuning/serving for teams standardizing on vLLM. ⁵

For agent developers, the match is pragmatic: Gemma 4’s native agent tools find an execution layer optimized for throughput and memory with paged attention, while vLLM inherits the model family’s “intelligence-per-parameter” ethos to reduce hardware overhead for comparable quality — a key lever for cost-sensitive deployments. ⁵

Microsoft Bing’s Harrier: Open multilingual embedding models top MTEB v2

Microsoft’s Bing team open-sources Harrier, a multilingual embedding model suite trained on over 2B examples (plus GPT‑5 synthetic data), reporting a top rank on multilingual MTEB v2. The flagship harrier‑oss‑v1‑27b (approx. 27B total params, 25.6B active) uses a 5,376-dim embedding, supports up to 131,072 tokens, and ships alongside 0.6B and 270M variants — all under the MIT license on Hugging Face. ⁶

Embedding models turn text into vectors for semantic search and retrieval, a backbone of Retrieval‑Augmented Generation (RAG) and agents. Harrier’s reported zero-shot gains over proprietary baselines (OpenAI, Amazon) and coverage of 100+ languages make it attractive for global search/RAG stacks where licensing flexibility and long contexts matter. ⁶

Microsoft plans to integrate Harrier into Bing and new grounding services for AI agents, which could tighten loops between retrieval quality, tool-use reliability, and end-to-end agent task success — especially in multilingual or long-document settings. ⁶

Research Papers

REAM: Merging experts to compress MoE without steep quality loss

Router-weighted Expert Activation Merging (REAM) proposes grouping and merging experts in Mixture‑of‑Experts (MoE) LLMs rather than pruning them outright, aiming to preserve capacity while cutting memory. Compared to Router-weighted Expert Activation Pruning (REAP), REAM often yields better Pareto trade-offs between multiple-choice (MC) and generative (GEN) performance by controlling calibration data mixes (general, math, coding). In many cases, REAM approaches uncompressed baselines on diverse QA and generation benchmarks. ⁷

Why merging vs. pruning? Removing experts can delete rare-but-crucial skills; merging retains weight space structure guided by router signals. The authors surface a systematic MC–GEN trade-off and show how calibration data selection shifts the frontier — practical when compressing MoE models for constrained hardware. For readers new to MoE: a router activates a small subset of specialized “experts” per token, yielding big capacity with sparse compute. ⁸

Security context matters: recent work shows MoE-specific backdoors by poisoning “dormant experts” (BadMoE), and pruning-time attacks where models behave benignly until pruned — then display malicious behavior with up to 95.7% jailbreak success. Compression in the wild must therefore combine utility-aware methods like REAM with hardening and validation pipelines. ⁹ ⁸ ¹⁰

ClawsBench: Realistic, stateful office workflows to test LLM agents

ClawsBench introduces five high-fidelity mock services — Gmail, Slack, Google Calendar, Google Docs, Google Drive — with full state and snapshot/restore, plus 44 tasks spanning single-service, cross-service, and safety-critical scenarios. With full scaffolding (domain skills + meta prompt), agents score 39–64% task success, but exhibit 7–33% unsafe action rates; on OpenClaw, top models cluster at 53–63% success with 7–23% unsafe rates, and no clean success–safety ordering emerges. ¹¹

The study decomposes agent scaffolding into “knowledge injection” vs. “coordination,” revealing how each lever independently shifts outcomes — a useful lens when tuning real systems. It also catalogs eight recurring unsafe behaviors (e.g., multi-step sandbox escalation, silent contract modification), underscoring that capability gains must be paired with isolation and guardrails. ¹¹

Context with frontier agents: some systems like Kimi K2.5 emphasize native multi-agent “swarm” orchestration (up to 100 agents), strong browsing (74.9% BrowseComp), and 256K context via MoE (1T params, 32B active). As orchestration moves inside models, ClawsBench-like, stateful sandboxes and permission models become more critical to bound failures. ¹² ¹³

CUE‑R: Measuring per‑evidence utility in RAG via lightweight interventions

CUE‑R evaluates Retrieval‑Augmented Generation (RAG) by perturbing individual retrieved items — REMOVE, REPLACE, DUPLICATE — and measuring shifts in correctness, a proxy grounding score, confidence error, and trace divergence. On HotpotQA and 2WikiMultihopQA with Qwen‑3 8B and GPT‑5.2, REMOVE/REPLACE consistently hurt correctness/grounding with large trace changes, while DUPLICATE is often redundant but not fully neutral. Multi-hop abl ablations show non‑additive interactions: removing two supports hurts far more than either alone. ¹⁴

This reframes RAG evaluation from answer-only to per-document operational utility: which context chunk actually mattered? It complements standard metrics (exact match, citation faithfulness) and helps debug retrieval pipelines by identifying harmful or redundant items — a practical step for high-stakes deployments. ¹⁴

Practitioner takeaway: pair CUE‑R’s per-item utility view with retrieval hygiene. Business guides emphasize RAG’s components (vector DB, retriever, embeddings, orchestration), and engineering notes show multi-stage pruning — dense retrieval → cross-encoder rerank → semantic thresholds → de-duplication — can curb hallucinations, token bloat, and sharpen attention. ¹⁵ ¹⁶ ¹⁷

Community Pulse

Hacker News (93↑) — Progress on smaller, efficient multimodal models is welcomed, but many doubt benchmark gains equal real-world understanding.

[EN quote] "I'm very happy to read about this progress but I don't find it particularly surprising... a model with 100x few parameters can achieve the same scores." — Hacker News

[EN quote] "...impressive that it runs on your gpu, but it still can’t tell you what happens if you tilt a glass... A four year old can do this and we’re just barely on step one and a half." — Hacker News

Hacker News (186↑) — ClawsBench is criticized as overhyped; commenters argue prior, larger benchmarks did the hard work without attention.

[EN quote] "...attention credit... is very unfairly distributed... a ridiculous benchmark of 10 question scored big." — Hacker News

[EN quote] "We will learn if the magical capabilities attributed to these tools are really true or not. They're not... there is nothing novel in this benchmark." — Hacker News

Hacker News (69↑) — Mixed views on CUE‑R: debate over biology-inspired methods vs. analogies and whether the setup matches real detection tasks.

[EN quote] "...apply the methods of biology... akin to phylogeny... not from a biological analogy but from a biologically-inspired method." — Hacker News

[EN quote] "AI detection is generally focused on determining the writer of a single document... Totally different problem." — Hacker News

Why It Matters

Today’s updates connect three threads: open, efficient frontier models (Gemma 4, Phi‑4 RV), infrastructure ready on day zero (vLLM), and evaluation/compression work that pressures test agents and MoE models in realistic constraints. The numbers — single‑GPU fits, 128K–256K context, 85.7% GPQA, 39–64% agent success with 7–33% unsafe actions — quantify where capability and safety actually stand. ¹ ¹¹

For builders, the strategic shift is clear: performance‑per‑parameter and licensing freedom (Apache 2.0, MIT) expand what can run locally, while new benchmarks and methods (REAM, CUE‑R, ClawsBench) push us to validate not only how well models answer but how safely they act and which evidence they truly use. ² ⁷ ¹⁴

Sources 16

[1] Goo Gemma 4: Our most capable open models to date [2] Deepmind Gemma 4: Byte for byte, the most capable open models [3] Agent-engineering Google Releases Gemma 4: The Most Capable Open-Weight Multimodal Models of 2026 [4] Vllm Announcing Gemma 4 on vLLM [5] Marvin-42 Google DeepMind launches Gemma 4 open models with Apache 2.0 licensing and native agent features [6] Microsoft Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model [7] The-decoder Microsoft's Bing team open-sources 'Harrier' embedding model [8] Arxiv REAM: Merging Improves Pruning of Experts in LLMs [9] Aisecurity-portal BadMoE: Backdooring Mixture-of-Experts LLMs [10] Freecodecamp How the Mixture of Experts Architecture Works in AI Models [11] Arxiv ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents [12] Remoteopenclaw Kimi K2.5 on OpenClaw: Agent Swarm, Benchmarks, and Setup Guide [13] Penligent Sandboxes for Coding Agents [14] Arxiv CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation [15] Dev Context Pruning Unlocks Superior RAG Accuracy Metrics [16] Dev How to Implement Semantic Pruning in Your RAG Stack

Helpful?

0to1log Weekly

Latest AI News