AI NewsResearch

7 min read 4/6/2026

NVIDIAMixture-of-ExpertsMambaagentic-AIretrievalcomputer-use-agents

NVIDIA’s Nemotron 3 Super bets on Mamba-Transformer MoE and 1M context to power autonomous agents

A 12B-active hybrid MoE with latent routing, multi-token prediction, and native 4-bit pretraining lands — plus a vLLM-backed retriever update and a 7B on-device web agent. Benchmarks are up; so are the system questions.

Find in this article

Reading Mode

One-Line Summary

NVIDIA ships a 120B open hybrid MoE built for agents, Microsoft releases a 7B computer-use agent, and new papers probe agent safety and multimodal process evaluation.

LLM & SOTA Models

Nemotron 3 Super: A Hybrid Mamba–Transformer MoE for Agentic Reasoning

NVIDIA releases Nemotron 3 Super, a fully open 120B total/12B active-parameter hybrid Mixture-of-Experts (MoE) model targeting multi-agent workloads that produce up to 15x more tokens than chat, combining a native 1M-token context window with throughput over 5x its prior Super—aimed at reducing the multi-agent “thinking tax” and long-context “goal drift.” It integrates Mamba-2 state space layers for linear-time sequence handling and interleaved Transformer attention for precise recall, and posts 85.6% on the PinchBench OpenClaw-agent test—best in its open-class. Weights, datasets, and training recipes are open for customization. ¹

Key ingredients: Latent MoE compresses token representations so Super can consult 4x more experts at the same inference cost; Multi-Token Prediction (MTP) forecasts several future tokens per pass to enable built-in speculative decoding for up to 3x wall-clock speedups; and native 4-bit NVFP4 pretraining (optimized for Blackwell) that cuts memory and yields 4x faster inference on B200 vs FP8 on H100 while maintaining accuracy. Training spans 25T tokens pretrain, ~7M SFT samples, and reinforcement learning across 21 NeMo Gym environments with 1.2M+ rollouts for robust multi-step behavior. ¹

In practice, Super is positioned to orchestrate complex plans while smaller peers (Nemotron 3 Nano) execute targeted steps—the “Super + Nano” pattern—especially for software development and cybersecurity triage. A hands-on report shows the 120B-class model loading on a consumer RTX 4060 Ti 16GB via LM Studio by dialing down active experts and GPU offload, sustaining roughly 6 tokens/second—underscoring MoE’s partial-activation advantages for local experimentation despite constrained VRAM. ²

Zooming out, ecosystem pieces around agentic compute are forming: commentary highlights NVIDIA’s Agent Toolkit with Nemotron models, AI-Q routing to balance cost/quality, and OpenShell sandboxing to curb agent risk—paired with the company’s Rubin platform for agentic inference bandwidth/latency demands. It argues open software atop CUDA creates a sticky platform, projecting up to 50% query cost cuts through intelligent routing while noting persistent risks from LLM unpredictability. ³

Open Source & Repos

Fara-7B: An Efficient Agentic Model for Computer Use

Microsoft Research introduces Fara-7B, an open-weight, 7B-parameter computer-use agent that sees screenshots and takes actions (scroll, click, type) by predicting coordinates—no accessibility-tree parsing or extra screen-parser required. It is small enough to run on-device (Copilot+ PCs), reducing latency and preserving privacy. The team built a synthetic data pipeline producing 145,000 trajectories (~1M steps) verified by multiple agents to train multi-step web task competence. ⁴

On benchmarks, Fara-7B reaches 73.5% on WebVoyager, 34.1% on Online-Mind2Web, 26.2% on DeepShop, and 38.4% on the new WebTailBench—outperforming prior 7B computer-use agents and approaching much larger multi-model orchestration systems that are costlier to run. Crucially, it averages ~16 steps per task versus ~41 for UI-TARS-1.5-7B, suggesting stronger step-efficiency—key for agentic reliability and cost. The model is available under MIT license on Foundry and Hugging Face with a quantized, silicon-optimized build for Windows devices. ⁴

Related enterprise agent orchestration is maturing across Microsoft’s stack: reports describe multi-model Copilot Researcher sessions (adding Anthropic Claude), Copilot Cowork for long-running tasks, and Copilot Studio’s multi-agent orchestration, Prompt Builder, and governance—with Agent-to-Agent (A2A) protocols and Fabric integration—signaling a shift from single-model chat to layered, auditable workflows. While not specific to Fara-7B, this context explains where compact, local CUA models fit in larger systems. ⁵ ⁶

NeMo-Retriever: vLLM-backed Nemotron Parse and GPU Scheduling Heuristics

A major NeMo-Retriever update replaces the older OCR pipeline and adds a new vLLM-backed Nemotron Parse path, including a 635-line nemotron_parse module and 470-line postprocessor—plus a chat completions client—aimed at faster, more scalable batching and KV-cache handling. The executor auto-enlarges batch-size for NemotronParse actors and reserves exclusive GPUs where needed, reflecting vLLM’s scheduling requirements. Total change spans +1449/-576 across 17 files. ⁷

Ingest graphs now render PDF pages to images for Nemotron Parse, propagate model/endpoint configuration, and route parse actors through GPU-aware operators with cluster resource heuristics. A monkey-patch workaround addresses a vLLM NemotronParseProcessor duplicate-kwargs bug to stabilize production inference. For builders, this means more robust document parsing as a microservice for downstream generative apps. ⁷

Practically, the switch centralizes batching in the vLLM engine, lifts throughput via continuous batching, and removes Transformers cache incompatibilities noted with newer versions—useful for long-context, image-to-structured-text extraction pipelines that feed agent reasoning with structured documents. ⁷

Research Papers

AgentHazard: Measuring Harm Emergence in Computer-Use Agents

AgentHazard introduces a 2,653-instance benchmark exposing how harmful outcomes can arise from sequences of individually plausible steps in computer-use agents—testing if systems can recognize and interrupt harm emerging from accumulated context, repeated tool use, and step dependencies. Evaluations on Claude Code, OpenClaw, and IFlow (with open or openly deployable Qwen3/Kimi/GLM/DeepSeek models) show high vulnerability; notably, Claude Code powered by Qwen3-Coder hits a 73.63% attack success rate—model alignment alone isn’t sufficient for safety. ⁸

A developer report echoes this gap: probing a LangGraph ReAct agent (Groq Llama-3.3-70B) reveals two critical tool-misuse issues—SQL injection passing through tool arguments despite the LLM “warning,” and path traversal reading /etc/passwd—pinpointing the unguarded 200ms “tool execution gap” between LLM decision and function execution. New probes in agent-probe v0.6.0 target input validation, SSRF via tool params, and chained exfiltration. ⁹

Community curation around leaked Claude Code materials highlights growing interest in dissecting agent memory, tool-calling, and safety scaffolding—reinforcing AgentHazard’s focus on process-level defenses rather than solely LLM prompt-level safeguards. ¹⁰

Agentic-MME: Process-Verified Multimodal Agent Evaluation

Agentic-MME proposes a process-verified benchmark for Multimodal Large Language Models (MLLMs) that act as agents via visual and knowledge expansion. It comprises 418 real-world tasks across 6 domains and 3 difficulty levels with over 2,000 stepwise checkpoints (10+ human-hours annotation per task), auditing intermediate states rather than only final answers—plus an “overthinking” metric to quantify efficiency versus human trajectories. ¹¹

Results set a sobering baseline: the top model, Gemini 3 Pro, reaches 56.3% overall and drops to 23.0% on the hardest Level-3 tasks, underscoring the difficulty of real multimodal problem solving. The framework also supports sandboxed code/APIs and evaluates whether tools were actually invoked and applied correctly—closing a common eval blind spot. ¹¹

Industry writeups on agentic models (e.g., Qwen 3.6 Plus) emphasize 1M-token contexts and always-on reasoning for stability across steps, but also note that scores vary by scaffolding on agentic benchmarks like Terminal-Bench and MCPMark—supporting Agentic-MME’s premise that process and orchestration matter as much as base model quality. ¹² ¹³

CoME-VL: Fusing Contrastive and Self-Supervised Vision Encoders

CoME-VL explores fusing complementary visual representations—contrastive (CLIP-style) and self-supervised (DINO)—for vision-language models via entropy-guided multi-layer aggregation with orthogonality constraints and RoPE-enhanced cross-attention to align heterogeneous token grids. Injected into a decoder-only LLM with minimal pipeline changes, the fused tokens improve visual understanding by an average 4.9% and grounding by 5.4% across benchmarks, achieving state-of-the-art RefCOCO detection. ¹⁴

Ablations on layer merging, non-redundant feature mixing, and fusion capacity show that contrastive and self-supervised signals contribute different, additive strengths—suggesting future VLMs should treat encoders as a modular ensemble rather than a single choice. For application builders, this argues for representation-level fusion to gain robustness without heavy architectural overhauls. ¹⁴

Decipher-MR: A 3D MRI Vision-Language Foundation Model

Decipher-MR trains a 3D MRI-specific vision-language foundation model on 200,000 MRI series from 22,000+ studies, combining self-supervised vision learning with report-guided text supervision. It supports a modular design where lightweight decoders are tuned on top of a frozen encoder to address disease classification, demographic prediction, anatomical localization, and cross-modal retrieval—consistently outperforming task-specific and general FM baselines on reported datasets. ¹⁵

While weights and proprietary data aren’t open due to privacy/IP constraints, the methods reference open frameworks (DINOv2, OpenCLIP, HF Trainer) and provide enough detail for reproduction attempts. The work reflects a broader shift toward domain-specific multimodal FMs where paired clinical text guides robust 3D representations for downstream tasks. ¹⁵

Why It Matters

Agentic AI is moving from demos to engineering: Nemotron 3 Super targets the compute and context realities of long-running agents with MoE, Mamba, and native 4-bit training; compact agents like Fara-7B make on-device computer-use practical; and new benchmarks (AgentHazard, Agentic-MME) raise the bar by checking process safety and stepwise tool use. The throughline is clear: orchestration, safety at the tool boundary, and long-context efficiency are now first-class design goals. ¹ ⁴ ⁸ ¹¹

For practitioners, expect two practical shifts: platform lock-in pressures as agent stacks standardize around specific toolchains and hardware (CUDA/Rubin), and rising evaluation overhead to verify not just answers, but the journey—inputs, tool calls, and guardrails. Investing in harnesses, validators, and multi-model orchestration will often beat chasing the next raw model upgrade. ³ ⁹

Sources 16

[1] Nvidia Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning [2] Franksworld How to Run NVIDIA’s Nemotron 3 Super on a 16GB VRAM [3] Memesita Nvidia Agent Toolkit: Dominating the AI Platform & Enterprise Future [4] Github NVIDIA/NeMo-Retriever commit 349ce96 [5] Microsoft Fara-7B: An Efficient Agentic Model for Computer Use [6] Windowsforum Microsoft 365 Copilot Researcher Goes Multi-Model [7] Metodoviral Copilot Studio: multi-agent systems integration, prompt editor, and governance [8] Arxiv AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents [9] Dev I Tested a Real AI Agent for Security [10] Github Issue: Add AgentLint — repo linter for AI agent compatibility [11] 80aj Anthropic 源码泄露后续：深度剖析 Claude Agent 架构设计 [12] Arxiv Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? [13] Serenitiesai The Rise of Agentic AI: How Qwen 3.6 Plus Changes What Models Can Do [14] Medium Qwen 3.6‑Plus: An LLM Designed to Execute, Not Just Explain [15] Arxiv CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning [16] Nature Decipher-MR: a vision-language foundation model for 3D MRI representations

Helpful?

0to1log Weekly

Latest AI News