NVIDIA’s Nemotron 3 Super pairs 1M-token context with latent MoE and MTP to push agentic throughput
A hybrid Mamba-Transformer MoE with native 4‑bit pretraining and multi-token prediction lands—plus fresh results in computer-use agents and compact multimodal reasoning.
One-Line Summary
NVIDIA debuts a long-context, hybrid Mamba‑Transformer MoE built for agents, while compact multimodal models hit new highs in computer use and document intelligence—and fresh studies dissect where LLM reasoning breaks.
LLM & SOTA Models
Nemotron 3 Super: A Hybrid Mamba‑Transformer MoE for Agentic Reasoning
Multi-agent apps multiply tokens up to 15x per session; Nemotron 3 Super tackles this with a native 1,000,000‑token window and a hybrid backbone that mixes Mamba‑2 layers (linear-time sequence handling) with interleaved Transformer attention for precise recall. It’s a 120B total, but only 12B active-parameter Mixture‑of‑Experts (MoE), yielding over 5x throughput versus the prior Super—aimed squarely at reducing the “thinking tax” of large reasoning models. 1
Two standout design choices: latent MoE compresses tokens before routing so the model can consult 4x more experts at the same cost, and multi‑token prediction (MTP) forecasts several tokens per forward pass for built‑in speculative decoding—reported up to 3x wall‑clock gains on structured generation like code and tool calls. Native NVFP4 pretraining, optimized for Blackwell, cuts memory and speeds inference ~4x on B200 vs FP8 on H100 while preserving accuracy. 1
Training spans 25T tokens pretrain (10T unique), ~7M SFT examples, and reinforcement learning across 21 environments with >1.2M rollouts via NeMo Gym/RL to align with multi‑step agent behavior. On PinchBench (OpenClaw agent brain test), Super posts 85.6%, topping its open peers. Weights, datasets, and recipes are open, enabling on‑prem deployment for “sovereign AI.” 1
Why enterprises care: with open weights, firms can keep code and PII inside firewalls—Scalexa dubs this “Sovereign Intelligence,” advocating on‑prem NIMs/Ollama and Trusted Execution Environments to protect IP. This directly addresses lingering privacy fears around cloud training reuse. 2
Nemotron‑Cascade 2: Small‑active MoE Chases Olympiad‑level Reasoning
Nemotron‑Cascade 2 is a 30B MoE with just 3B active parameters that approaches frontier open models on math and coding—reportedly earning Gold‑medal‑level performance on the 2025 IMO, IOI, and ICPC Finals. The recipe extends “Cascade RL” across more reasoning/agentic domains after broad SFT, then uses multi‑domain on‑policy distillation from strong intermediate teachers to stabilize gains and recover regressions. 3
Its SFT corpus is unusually comprehensive: long‑context (up to 256K packed), STEM with tool‑using Python traces (1.8M) and non‑tool traces (1.9M), proof generation/verification (≈810K), 125K agentic SWE trajectories plus 389K agentless code‑repair/test data, and a 490K “Terminal Agent” dataset generated in Docker via execution‑feedback loops. This blends “thinking” and “non‑thinking” modes depending on task. 3
RL stages include instruction‑following without a reward model (to maximize objective verifiability), multi‑domain RL (≈55% MCQA, 30% agentic tool calling, 15% structured output), RLHF with a Generative Reward Model (Qwen3–235B variants), plus domain‑targeted long‑context and code RL. The on‑policy distillation objective applies a reverse‑KL token‑level advantage with truncated importance weighting to align student and domain teachers. 3
Holo3: Computer‑Use Agent Hits 78.85% on OSWorld‑Verified with 10B Active Params
H Company unveils Holo3, a computer-use specialist scoring 78.85% on OSWorld‑Verified while running just 10B active parameters (122B total). It’s trained with an “agentic flywheel”: synthetic navigation data, out‑of‑domain augmentation, and curated reinforcement learning tuned to real enterprise workflows. A Synthetic Environment Factory generates websites and end‑to‑end verifiable tasks, and H Corporate Benchmarks cover 486 multi‑step tasks across e‑commerce, business software, collaboration, and multi‑app setups. 4
The key pitch is production readiness: higher success rates than larger baselines at comparable grounding/localization, and open weights for Holo3‑35B‑A3B under Apache‑2.0, with all models available via an inference API (free tier included). The roadmap targets “Adaptive Agency,” i.e., learning novel enterprise UIs in real time. 4
Granite 4.0 3B Vision: A 3B VLM for Tables, Charts, and KVPs
IBM’s Granite 4.0 3B Vision is a compact VLM LoRA‑adapter atop Granite 4.0 Micro, focused on enterprise document understanding: table extraction, chart reasoning, and semantic key‑value pairs (KVP). A million‑scale ChartNet dataset (1.7M charts across 24 types, 6 plotting libraries) aligns plotting code, rendered images, data tables, summaries, and QA—teaching models what a chart means, not just how it looks. 5
A DeepStack Injection architecture routes abstract visual features earlier and high‑res spatial features later, preserving layout precision. Results: Chart2Summary 86.4% (best), Chart2CSV 62.1% (2nd to Qwen3.5‑9B). For tables, it leads on PubTables‑v2 cropped 92.1 and full‑page 79.3, OmniDocBench 64.0, TableVQA‑extract 88.1; VAREX KVP hits 85.5% EM zero‑shot. Modular deployment falls back to text‑only Granite when vision isn’t needed. 5
Falcon Perception: 0.6B Early‑Fusion Grounding and Segmentation
TII releases Falcon Perception, a 0.6B early‑fusion Transformer that processes image patches and text in one sequence via a hybrid attention mask. It achieves 68.0 Macro‑F1 on SA‑Co (vs 62.3 for SAM 3), with a remaining gap in presence calibration (MCC 0.64 vs 0.82). A “Chain‑of‑Perception” interface predicts coordinate → size → segmentation, using Fourier features and a lightweight mask head. 6
Falcon also introduces PBench to diagnose capability gaps (attributes, OCR‑guided disambiguation, spatial relations, crowded scenes). On complex prompts, Falcon significantly outperforms SAM 3, and it surpasses generalist VLMs on dense scenes. A companion Falcon OCR (0.3B) posts 80.3 on olmOCR and 88.6 on OmniDocBench with best‑in‑class throughput among open OCRs. 6
Reka Model Lineup: Fully Multimodal from 2B to 67B
Reka details a family from Spark (2B) to Core (67B), all fully multimodal (images, audio, video, text) with 128K context. The models do OCR, charts/tables, 5‑minute video understanding (longer via streaming), and natively support instruction chaining, coding, function calling, and tool use. Audio output is supported via generated audio tokens. 7
Positioning is by deployment tier: on‑device (2B), laptops/desktops (7B), on‑prem/private cloud (21B), and complex use cases or distillation (67B). The pitch emphasizes “action bots” that can accomplish tasks, not just chat—aligning with the broader turn toward agentic systems. 7
Open Source & Repos
Open Multi‑Agent: Orchestrating Teams with DAG Scheduling
Open Multi‑Agent is a production‑grade orchestration framework for AI agent teams. It’s model‑agnostic (Claude, GPT, local via Ollama/vLLM/LM Studio), provides a shared message bus and memory, and schedules dependent tasks with a DAG so independent steps run in parallel while respecting dependencies. It targets multi‑role setups (planner, implementer, reviewer) common in software and ops. 8
The framework abstracts team collaboration, inter‑agent messaging, and tool access—key plumbing usually rebuilt in each project. For practitioners, this reduces glue code and makes it easier to scale from one agent to coordinated specialists. 8
Open Agent SDK (TypeScript): In‑Process Agent Loop, No CLI
Open Agent SDK runs the full agent loop in‑process—no subprocesses or CLI required—and deploys anywhere (cloud, serverless, Docker, CI/CD). It supports streaming and blocking prompts, sub‑agents, 34 built‑in tools (Bash, Read, Write, Grep, WebSearch), Model Context Protocol (MCP) servers, structured output, budgets, and sandboxing. Environment variables configure third‑party providers (e.g., OpenRouter). 9
Developers can define tools with Zod schemas, spin up MCP servers, persist sessions, fork branches, and instrument lifecycle hooks. A built‑in web chat UI aids testing. For teams standardizing on one SDK across providers, the model catalog and routing simplify multi‑model ops. 9
Claude Code Any: Claude‑Style Coding Agent with Any LLM
Claude Code Any reimplements Claude Code’s coding‑agent CLI but swaps in any backend: OpenAI, DeepSeek, Anthropic, Groq, Together, OpenRouter, or local (Ollama, LM Studio, vLLM). It keeps the full toolchain—file edit, bash, grep, glob, multi‑file planning—and adds smart routing profiles (best/balanced/cheap/private) to send planning vs. bug‑fix vs. search to different models automatically. 10
It detects existing API keys, integrates with OpenClaw workflows, exposes diagnostics (model metadata for 4,108 models from models.dev), and supports Docker/binary installs. For orgs with mixed providers or strict privacy, this is a practical bridge to reuse Claude‑like flows on local or preferred APIs. 10
Research Papers
Phi‑4‑reasoning‑vision‑15B: Lessons from Training a Compact Multimodal Reasoner
Microsoft introduces a 15B open‑weight VLM that pushes the accuracy–compute Pareto frontier using careful architecture and data curation. It adopts a mid‑fusion design with a SigLIP‑2 Naflex dynamic‑resolution encoder, showing that dynamic encoders and higher token caps (≈3600) materially boost high‑res tasks like ScreenSpot‑Pro compared to S2/multi‑crop variants. Training uses far fewer multimodal tokens (≈200B) than many peers. 11
The model excels in math/science reasoning and UI grounding, outperforming similarly fast models and rivaling much slower ones. Takeaway: with stronger data filtering/fixing and targeted synthetic augmentation, small VLMs can match or beat larger systems on practical tasks without bloated token budgets. 11
Reasoning Shift: Context Can Silently Shorten LLM Reasoning
A new study finds that the same problem yields up to 50% shorter reasoning traces when surrounded by long irrelevant text, split into turns, or embedded as a subtask—coinciding with drops in self‑verification (e.g., double‑checking). Performance on easy items holds, but harder items may suffer. It’s a reminder that context management in agents is not just about retrieval—it changes test‑time scaling behavior. 12
Complementary coverage describes a “brain activity collapse” under hard questions: internal activation diversity/magnitude contracts as difficulty rises, suggesting fallback to simpler heuristics instead of deeper multi‑step reasoning. This offers a mechanistic signal to detect overconfidence and steer training/eval beyond output accuracy. 13
GSM1k vs GSM8k: Mind the Overfitting Gap
Scale Labs’ GSM1k, matched in style to GSM8k, reveals up to 13% accuracy drops for some families (e.g., Phi, Mistral)—evidence of partial memorization or Goodharting—while frontier models degrade minimally. Correlation between GSM8k‑generation likelihood and GSM8k–GSM1k gaps underscores that “high score” ≠ “robust reasoning.” Overfit models can still reason, but scores overstate capability. 14
MiroEval: Measuring Deep Research Agents in Process and Outcome
MiroEval proposes 100 live‑refreshable tasks (70 text, 30 multimodal) grounded in real user needs, then scores systems along three axes: adaptive synthesis quality, agentic factuality (active retrieval + reasoning over web and attachments), and process‑centric audits of how the agent searched and revised. Process quality predicts outcomes and exposes weaknesses missed by output‑only grading; multimodal items are 3–10 points harder. 15
ReCQR: Conversational Query Rewriting for Multimodal Retrieval
ReCQR inserts a “translator” that rewrites vague, context‑dependent user queries into self‑contained prompts before retrieval, boosting recall versus encoding the whole dialogue. The 7K‑dialogue dataset spans text‑only and multi‑image settings, with LLM‑as‑judge filtering and human verification. Models trained with image history significantly outperform text‑only on multi‑image tasks, reducing ambiguity like “that scene.” 16
Human‑in‑the‑Loop Reasoning Verification
A recent NSF‑archived paper explores interactive explanation interfaces to improve how humans verify LLM reasoning, highlighting UI choices that help people catch mistakes in chains of thought—relevant as we push agents to expose process traces. 17
Community Pulse
Hacker News (93↑) — Mixed: excitement about small models matching big benchmarks locally, skepticism about true understanding.
"I'm very happy to read about this progress but I don't find it particularly surprising. The big labs optimize for accuracy/high scores on benchmarks first; I automatically expect that (with some research effort) a model with 100x few parameters can achieve the same scores."
"yeah i know lol, that’s kind of my point. impressive that it runs on your gpu, but it still can’t tell you what happens if you tilt a glass... A four year old can do this and we’re just barely on step one and a half."
Why It Matters
Today’s releases converge on a theme: agent‑ready efficiency. Nemotron 3 Super and Holo3 target long‑horizon, tool‑using workflows with architectural changes (Mamba, latent MoE, MTP) and training flywheels to tame token costs and context sprawl. Meanwhile, compact VLMs (Phi‑4‑RV, Granite 3B, Falcon 0.6B) show that smart data and fusion choices can beat bigger models where it counts—screens and documents. 1 4 11 5 6
But research warns that context can change how models “think,” scores can Goodhart, and internal activations may collapse under strain—arguing for process‑aware evals like MiroEval and tooling that detects shallow reasoning. Expect the next wave to blend efficient architectures with robustness checks that look beyond final answers. 12 14 15 13
Comments (0)