NVIDIA’s Nemotron 3 Super bets on Mamba-Transformer MoE and 1M context to power autonomous agents
A 12B-active hybrid MoE with latent routing, multi-token prediction, and native 4-bit pretraining lands — plus a vLLM-backed retriever update and a 7B on-device web agent. Benchmarks are up; so are the system questions.
One-Line Summary
NVIDIA ships a 120B open hybrid MoE built for agents, Microsoft releases a 7B computer-use agent, and new papers probe agent safety and multimodal process evaluation.
LLM & SOTA Models
Nemotron 3 Super: A Hybrid Mamba–Transformer MoE for Agentic Reasoning
NVIDIA releases Nemotron 3 Super, a fully open 120B total/12B active-parameter hybrid Mixture-of-Experts (MoE) model targeting multi-agent workloads that produce up to 15x more tokens than chat, combining a native 1M-token context window with throughput over 5x its prior Super—aimed at reducing the multi-agent “thinking tax” and long-context “goal drift.” It integrates Mamba-2 state space layers for linear-time sequence handling and interleaved Transformer attention for precise recall, and posts 85.6% on the PinchBench OpenClaw-agent test—best in its open-class. Weights, datasets, and training recipes are open for customization. 1
Key ingredients: Latent MoE compresses token representations so Super can consult 4x more experts at the same inference cost; Multi-Token Prediction (MTP) forecasts several future tokens per pass to enable built-in speculative decoding for up to 3x wall-clock speedups; and native 4-bit NVFP4 pretraining (optimized for Blackwell) that cuts memory and yields 4x faster inference on B200 vs FP8 on H100 while maintaining accuracy. Training spans 25T tokens pretrain, ~7M SFT samples, and reinforcement learning across 21 NeMo Gym environments with 1.2M+ rollouts for robust multi-step behavior. 1
In practice, Super is positioned to orchestrate complex plans while smaller peers (Nemotron 3 Nano) execute targeted steps—the “Super + Nano” pattern—especially for software development and cybersecurity triage. A hands-on report shows the 120B-class model loading on a consumer RTX 4060 Ti 16GB via LM Studio by dialing down active experts and GPU offload, sustaining roughly 6 tokens/second—underscoring MoE’s partial-activation advantages for local experimentation despite constrained VRAM. 2
Zooming out, ecosystem pieces around agentic compute are forming: commentary highlights NVIDIA’s Agent Toolkit with Nemotron models, AI-Q routing to balance cost/quality, and OpenShell sandboxing to curb agent risk—paired with the company’s Rubin platform for agentic inference bandwidth/latency demands. It argues open software atop CUDA creates a sticky platform, projecting up to 50% query cost cuts through intelligent routing while noting persistent risks from LLM unpredictability. 3
Open Source & Repos
Fara-7B: An Efficient Agentic Model for Computer Use
Microsoft Research introduces Fara-7B, an open-weight, 7B-parameter computer-use agent that sees screenshots and takes actions (scroll, click, type) by predicting coordinates—no accessibility-tree parsing or extra screen-parser required. It is small enough to run on-device (Copilot+ PCs), reducing latency and preserving privacy. The team built a synthetic data pipeline producing 145,000 trajectories (~1M steps) verified by multiple agents to train multi-step web task competence. 4
On benchmarks, Fara-7B reaches 73.5% on WebVoyager, 34.1% on Online-Mind2Web, 26.2% on DeepShop, and 38.4% on the new WebTailBench—outperforming prior 7B computer-use agents and approaching much larger multi-model orchestration systems that are costlier to run. Crucially, it averages ~16 steps per task versus ~41 for UI-TARS-1.5-7B, suggesting stronger step-efficiency—key for agentic reliability and cost. The model is available under MIT license on Foundry and Hugging Face with a quantized, silicon-optimized build for Windows devices. 4
Related enterprise agent orchestration is maturing across Microsoft’s stack: reports describe multi-model Copilot Researcher sessions (adding Anthropic Claude), Copilot Cowork for long-running tasks, and Copilot Studio’s multi-agent orchestration, Prompt Builder, and governance—with Agent-to-Agent (A2A) protocols and Fabric integration—signaling a shift from single-model chat to layered, auditable workflows. While not specific to Fara-7B, this context explains where compact, local CUA models fit in larger systems. 5 6
NeMo-Retriever: vLLM-backed Nemotron Parse and GPU Scheduling Heuristics
A major NeMo-Retriever update replaces the older OCR pipeline and adds a new vLLM-backed Nemotron Parse path, including a 635-line nemotron_parse module and 470-line postprocessor—plus a chat completions client—aimed at faster, more scalable batching and KV-cache handling. The executor auto-enlarges batch-size for NemotronParse actors and reserves exclusive GPUs where needed, reflecting vLLM’s scheduling requirements. Total change spans +1449/-576 across 17 files. 7
Ingest graphs now render PDF pages to images for Nemotron Parse, propagate model/endpoint configuration, and route parse actors through GPU-aware operators with cluster resource heuristics. A monkey-patch workaround addresses a vLLM NemotronParseProcessor duplicate-kwargs bug to stabilize production inference. For builders, this means more robust document parsing as a microservice for downstream generative apps. 7
Practically, the switch centralizes batching in the vLLM engine, lifts throughput via continuous batching, and removes Transformers cache incompatibilities noted with newer versions—useful for long-context, image-to-structured-text extraction pipelines that feed agent reasoning with structured documents. 7
Research Papers
AgentHazard: Measuring Harm Emergence in Computer-Use Agents
AgentHazard introduces a 2,653-instance benchmark exposing how harmful outcomes can arise from sequences of individually plausible steps in computer-use agents—testing if systems can recognize and interrupt harm emerging from accumulated context, repeated tool use, and step dependencies. Evaluations on Claude Code, OpenClaw, and IFlow (with open or openly deployable Qwen3/Kimi/GLM/DeepSeek models) show high vulnerability; notably, Claude Code powered by Qwen3-Coder hits a 73.63% attack success rate—model alignment alone isn’t sufficient for safety. 8
A developer report echoes this gap: probing a LangGraph ReAct agent (Groq Llama-3.3-70B) reveals two critical tool-misuse issues—SQL injection passing through tool arguments despite the LLM “warning,” and path traversal reading /etc/passwd—pinpointing the unguarded 200ms “tool execution gap” between LLM decision and function execution. New probes in agent-probe v0.6.0 target input validation, SSRF via tool params, and chained exfiltration. 9
Community curation around leaked Claude Code materials highlights growing interest in dissecting agent memory, tool-calling, and safety scaffolding—reinforcing AgentHazard’s focus on process-level defenses rather than solely LLM prompt-level safeguards. 10
Agentic-MME: Process-Verified Multimodal Agent Evaluation
Agentic-MME proposes a process-verified benchmark for Multimodal Large Language Models (MLLMs) that act as agents via visual and knowledge expansion. It comprises 418 real-world tasks across 6 domains and 3 difficulty levels with over 2,000 stepwise checkpoints (10+ human-hours annotation per task), auditing intermediate states rather than only final answers—plus an “overthinking” metric to quantify efficiency versus human trajectories. 11
Results set a sobering baseline: the top model, Gemini 3 Pro, reaches 56.3% overall and drops to 23.0% on the hardest Level-3 tasks, underscoring the difficulty of real multimodal problem solving. The framework also supports sandboxed code/APIs and evaluates whether tools were actually invoked and applied correctly—closing a common eval blind spot. 11
Industry writeups on agentic models (e.g., Qwen 3.6 Plus) emphasize 1M-token contexts and always-on reasoning for stability across steps, but also note that scores vary by scaffolding on agentic benchmarks like Terminal-Bench and MCPMark—supporting Agentic-MME’s premise that process and orchestration matter as much as base model quality. 12 13
CoME-VL: Fusing Contrastive and Self-Supervised Vision Encoders
CoME-VL explores fusing complementary visual representations—contrastive (CLIP-style) and self-supervised (DINO)—for vision-language models via entropy-guided multi-layer aggregation with orthogonality constraints and RoPE-enhanced cross-attention to align heterogeneous token grids. Injected into a decoder-only LLM with minimal pipeline changes, the fused tokens improve visual understanding by an average 4.9% and grounding by 5.4% across benchmarks, achieving state-of-the-art RefCOCO detection. 14
Ablations on layer merging, non-redundant feature mixing, and fusion capacity show that contrastive and self-supervised signals contribute different, additive strengths—suggesting future VLMs should treat encoders as a modular ensemble rather than a single choice. For application builders, this argues for representation-level fusion to gain robustness without heavy architectural overhauls. 14
Decipher-MR: A 3D MRI Vision-Language Foundation Model
Decipher-MR trains a 3D MRI-specific vision-language foundation model on 200,000 MRI series from 22,000+ studies, combining self-supervised vision learning with report-guided text supervision. It supports a modular design where lightweight decoders are tuned on top of a frozen encoder to address disease classification, demographic prediction, anatomical localization, and cross-modal retrieval—consistently outperforming task-specific and general FM baselines on reported datasets. 15
While weights and proprietary data aren’t open due to privacy/IP constraints, the methods reference open frameworks (DINOv2, OpenCLIP, HF Trainer) and provide enough detail for reproduction attempts. The work reflects a broader shift toward domain-specific multimodal FMs where paired clinical text guides robust 3D representations for downstream tasks. 15
Why It Matters
Agentic AI is moving from demos to engineering: Nemotron 3 Super targets the compute and context realities of long-running agents with MoE, Mamba, and native 4-bit training; compact agents like Fara-7B make on-device computer-use practical; and new benchmarks (AgentHazard, Agentic-MME) raise the bar by checking process safety and stepwise tool use. The throughline is clear: orchestration, safety at the tool boundary, and long-context efficiency are now first-class design goals. 1 4 8 11
For practitioners, expect two practical shifts: platform lock-in pressures as agent stacks standardize around specific toolchains and hardware (CUDA/Rubin), and rising evaluation overhead to verify not just answers, but the journey—inputs, tool calls, and guardrails. Investing in harnesses, validators, and multi-model orchestration will often beat chasing the next raw model upgrade. 3 9
Comments (0)