Vol.01 · No.10 Daily Dispatch April 3, 2026

Latest AI News

AI · PapersDaily CurationOpen Access
AI NewsResearch
7 min read

Google’s Gemma 4 goes Apache 2.0, pushes local multimodal LLMs from phones to H100s

Open weights with real license freedom, 256k context, and edge variants tuned by Pixel’s silicon partners — plus NVIDIA’s 1M-token agent model and Microsoft’s new MAI stack. Here’s what actually changed.

Reading Mode

One-Line Summary

Google’s Gemma 4 arrives as four open-weight models under Apache 2.0, pushing local reasoning, coding, and multimodal performance from phones to workstations while rivals focus on agent throughput and audio/image foundations.

LLM & SOTA Models

Google Gemma 4 (Apache 2.0, edge-to-workstation lineup)

Google releases Gemma 4 in four sizes — Effective 2B (E2B), Effective 4B (E4B) for devices, plus 26B Mixture of Experts (MoE, a model type that activates only a subset of parameters per token) and 31B Dense — all under the permissive Apache 2.0 license. The large variants run unquantized in bfloat16 on a single 80GB NVIDIA H100; quantized builds target consumer GPUs. The 26B MoE activates only 3.8B parameters at inference for lower latency, while 31B Dense aims for higher quality. Context windows are 128K tokens on edge models and 256K on 26B/31B; all are trained across 140+ languages and natively support function calling and structured JSON (JavaScript Object Notation) output for agent workflows. 1 2

On the Arena AI open-model leaderboard, Gemma 31B debuts at No. 3 and the 26B MoE at No. 6, with Google claiming they outperform models up to 20× larger on that benchmark. The family is multimodal (images and video), with E2B/E4B also adding native audio input for speech recognition. For code generation, Google positions Gemma 4 as delivering high-quality offline coding comparable to cloud systems, contingent on hardware. The edge models target “near-zero latency,” with Android’s developers noting up to 4× speed-ups and up to 60% less battery than Gemma 3, and E2B running 3× faster than E4B. 2 3

The licensing shift from Google’s custom terms to Apache 2.0 is arguably the biggest change. Prior Gemma licenses had broad prohibited-use clauses and sticky obligations developers balked at; Apache 2.0 removes those constraints, enabling unrestricted commercial use and eliminating surprise term changes. Industry voices call it a “huge milestone” for open models. Weights are available now via Hugging Face, Kaggle, and Ollama; in Google AI Studio (31B, 26B) and AI Edge Gallery (E2B, E4B). Google also previews that Gemini Nano 4 for Android will be derived from E2B/E4B, with forward compatibility for agentic prototypes built today. 1 2

Why it matters: The combination of strong leaderboard results, big context windows for local use, agent-friendly function calling, and a truly permissive license resets the calculus for on-device and self-hosted deployments. It narrows the “closed vs open” gap for many real-world coding and multimodal tasks while removing legal friction that previously blocked enterprise adoption. 2 3

Microsoft’s MAI models: Transcribe, Voice, and Image/Video foundations

Microsoft introduces three foundational models from its MAI Superintelligence team: MAI-Transcribe-1 (speech-to-text across 25 languages), MAI-Voice-1 (audio generation), and MAI-Image-2 (a generative media model; described as video-generating by TechCrunch and a faster, more lifelike second-generation image model by CNET). Microsoft says Transcribe-1 is 2.5× faster than Azure Fast, Voice-1 can generate 60 seconds of audio in one second and supports custom voices, and MAI-Image-2 improves speed/realism with pricing of $5 per 1M tokens (text input) and $33 per 1M tokens (image output). Models are on Microsoft Foundry and MAI Playground, with MAI-Image-2 slated for Bing and PowerPoint. 4 5

Pricing signals a bid to undercut Google/OpenAI for enterprise workloads: MAI-Transcribe-1 starts at $0.36/hour, MAI-Voice-1 at $22 per 1M characters. Mustafa Suleyman frames the philosophy as “Humanist AI” — optimizing for human communication and practical use — and reiterates Microsoft’s ongoing partnership with OpenAI even as it builds its own stack. These releases broaden Microsoft’s platform beyond text LLMs and into latency-critical audio and high-throughput generative media. 4 5

For builders, the practical takeaway is coverage across common enterprise modalities (meetings, agents with voice, visuals). With lower list prices and deployment in Foundry/Playground, teams can test cost/performance trade-offs quickly, then graduate select capabilities into Microsoft products like Bing and PowerPoint for production reach. 4 5

NVIDIA Nemotron 3 Super: 120B hybrid MoE for agent throughput

NVIDIA launches Nemotron 3 Super, a 120-billion-parameter open-weight model with only 12B active parameters per token. It targets multi-agent systems’ two big pain points: “context explosion” and the “thinking tax.” It offers a native 1,000,000-token context window to retain full workflow state, claims up to 5× higher throughput and up to 2× higher accuracy than its predecessor, and introduces a hybrid architecture: Mamba state-space layers for efficient long-context processing plus Transformer layers for precise recall. Latent Mixture of Experts (MoE) and Multi-Token Prediction (predicting multiple next tokens at once) further push speed. On Blackwell GPUs, NVFP4 precision delivers up to 4× faster inference than FP8 on Hopper with no accuracy loss. 6

Benchmarks emphasize agent work: NVIDIA’s AI-Q agent tops DeepResearch Bench I/II; the model ranks highly on efficiency/openness at Artificial Analysis. Partners like Perplexity, Palantir, and Siemens integrate it for search, SWE agents, and industrial automation. Weights, data, and recipes (over 10T tokens of pre/post-training datasets and RL environments) are released, with deployment across Vertex AI, OCI, soon AWS Bedrock and Azure, NIM microservices, and inference providers like Baseten and Cloudflare. 6

Independent engineering write-ups highlight the practical gains: 1M-token context that actually works for whole-codebase reasoning, up to 2.2× higher throughput vs comparable open models, and a clear tradeoff profile versus frontier closed models. Related family models like Nemotron 3 Nano (30B total, ~3.2B active) show 223 tokens/s on H100 and sub-100 ms time-to-first-token in vLLM tests — a signal that the architecture’s throughput focus generalizes across sizes. Licensing is permissive but NVIDIA-specific, so enterprises should review terms. 7 8

Open Source & Repos

Gemma 4 availability wave (weights and tooling)

Gemma 4 models are immediately downloadable on Hugging Face, Kaggle, and Ollama, and runnable via Google AI Studio (31B, 26B) and AI Edge Gallery (E2B, E4B). This distribution pattern matters: teams can experiment locally, prototype with hosted UIs, and move to self-hosting without friction. The Apache 2.0 license removes prior compliance uncertainty, encouraging forks, fine-tunes, and redistribution at scale. 1 2

Because Gemma 4 borrows capabilities from Gemini 3 (reasoning, math, instruction following), the open-weight ecosystem can now inherit agent-friendly features like native function calling and structured outputs. That, combined with larger context windows (up to 256K) and multimodality, makes it feasible to build private, offline apps that previously required cloud-only frontier models. 2 3

In parallel, NVIDIA’s Nemotron 3 Super ships open weights plus “recipes” (datasets, RL environments) via NVIDIA NeMo and broad partner endpoints. Engineers can replicate training signals (10T+ tokens disclosed) and deploy with NIM microservices from on-prem to cloud. These releases collectively tighten the open model toolchain from experimentation to production. 6 7

Research Papers

AgentWatcher: A Rule-based Prompt Injection Monitor

The paper tackles prompt injection — where malicious instructions in the context hijack an agent — by first attributing the agent’s action to a small, causally influential subset of the context, then applying explicit, human-readable rules to that subset via a monitoring LLM. This two-step approach scales to long contexts and makes decisions explainable rather than opaque classifier scores. Evaluations on tool-use agent benchmarks and long-context datasets show effective detection while maintaining utility; code is open-sourced. 9

Why it matters: agent systems blur text and commands, so “what triggered this tool call?” is the central question. Industry guides now emphasize layered defenses (separate system prompts, tool-call validation, output filtering, least privilege, secondary judge models). Production detection must correlate behavior over an 8-stage attack chain — from poisoned data to data exfiltration — with signals across app, cloud, and kernel layers. AgentWatcher complements these practices by narrowing the evidence to what causally mattered. 10 11

Real-world developers report that agents can execute hidden instructions from web pages or PDFs, and sandboxing plus allow/deny prompts and whitelisting remain practical mitigations. A rules-plus-attribution monitor offers a principled way to reduce false positives in long contexts while preserving transparency — key for incident response and audits. 12

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

S0 Tuning adapts hybrid models (with recurrent state-space layers) by optimizing a single initial state matrix per recurrent layer, freezing all weights. With roughly 48 execution-verified HumanEval training solutions, S0 beats Low-Rank Adaptation (LoRA) by +10.8 percentage points on HumanEval (p < 0.001) — and does so with zero inference overhead. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 boosts greedy pass@1 by +23.6 ± 1.7 pp; on FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% ± 1.3 vs LoRA’s 71.4% ± 2.4. The tuned state is a ~48 MB file; no weight merging or model reload is needed to switch tasks. 13

Cross-domain transfer is significant on MATH-500 (+4.8 pp) and GSM8K (+2.8 pp) but not on Spider (text-to-SQL), consistent with a “trajectory steering” mechanism. A per-step state-offset variant yields +27.1 pp but incurs per-step cost. The upshot: for hybrid recurrent-attention LLMs, recurrent state initialization is a strong parameter-efficient fine-tuning surface when verified supervision is scarce, complementing adapter-based methods like LoRA/QLoRA. 13 14

Context: LoRA remains the go-to for broad transformer families due to maturity and tooling, yet S0’s zero-overhead runtime is compelling for latency-sensitive deployments. For practitioners, the decision hinges on architecture (hybrid vs pure transformer), supervision size, and whether swapping a ~48 MB state per task fits deployment constraints better than adapter stacks. 15 16

Community Pulse

Hacker News (873↑) — Enthusiasm centers on whether Gemma 4’s quality and footprint make it “good enough” to replace closed models for many tasks.

"If they pass what closed models today can do by much, they'll be "good enough" for what I want to do with them. I imagine that's true for many people."

"The real question is memory and compute: can the 26B MoE or 31B dense fit and run fast enough on consumer GPUs to be practical?"

Why It Matters

Open-weight releases are converging on three priorities: permissive licensing (Apache 2.0 for Gemma 4), edge efficiency (4× speed, up to 60% less battery), and agent readiness (function calling, long context, multimodality). In parallel, NVIDIA’s 1M-token, hybrid MoE architecture targets the operational bottlenecks of multi-agent systems: keeping full state and thinking fast. Together, they expand what can run locally and privately without giving up too much capability. 2 6

For teams, the calculus shifts from “Can we?” to “Should we self-host?” With Apache-licensed Gemma 4 and open Nemotron recipes, you can prototype on laptops, graduate to a single 80GB H100, and scale in cloud/on-prem while keeping data in your control — a path that was much harder a year ago. 1 7

Sources 15

Helpful?

Comments (0)