NVIDIA’s Nemotron 3 Super fuses Mamba, MoE, and NVFP4 to push long‑context agentic LLMs
A 120B open-weight hybrid that runs like 12B, a single-stream AV generator that beats open baselines, a 560B MoE prover with agentic RL, and a new 4D world-model benchmark—today’s drops reset efficiency and evaluation.
One-Line Summary
NVIDIA opens a 120B-parameter hybrid Mamba–Transformer MoE that thinks with 12B active parameters over a 1M-token window, while new papers push faster audio–video generation, stronger formal math provers, and interaction-centric world-model evaluation.
LLM & SOTA Models
NVIDIA Nemotron 3
SuperThink of long-running AI agents that must remember entire codebases or weeks of chat—Nemotron 3 Super is built for that. It’s a 120B total-parameter model that only activates 12B parameters per token via a Mixture of Experts (MoE), paired with a 1M-token context window so multi-agent systems don’t lose the plot as histories balloon 15× versus normal chats. Compared to the prior Nemotron Super, NVIDIA reports over 5× higher throughput and an 85.6% score on PinchBench, a benchmark for models acting as an OpenClaw agent’s “brain.” 1
Under the hood are three levers. First, a hybrid backbone interleaves Mamba-2 state space layers (linear-time sequence processing) with Transformer attention (precise associative recall), making the 1M-token window practical and accurate. Second, a latent MoE compresses token embeddings before routing, letting the model consult 4× more expert specialists at the cost of one—raising specialization without spiking latency. Third, multi-token prediction (MTP) forecasts several future tokens per forward pass, both strengthening long-range reasoning during training and enabling built-in speculative decoding for up to 3× generation speedups in code and tool calls. 1
Training-wise, Super is native to NVFP4 (NVIDIA’s 4‑bit floating-point) on Blackwell: most multiply–accumulate ops use 4‑bit throughout pretraining on 25 trillion tokens (10T unique curated), with ~7M supervised fine-tuning samples drawn from a 40M post-training corpus, then reinforcement learning across 21 environments with 1.2M+ rollouts in NeMo Gym. NVFP4 trims memory and lifts inference speed up to 4× on B200 versus FP8 on H100, and external reports show 478 tokens/s on B200—roughly 7.5× Qwen3.5‑122B—placing Super near the top of the Artificial Analysis index for sub‑250B open‑weight models. 1 2 3
Deployment details matter. Community runbooks show Super’s 120.6B/12.7B-active config with 512 experts (22 active per token), NoPE (no positional embeddings), and safe single‑GPU ceilings: Q4 GGUF fits on a single H100‑80GB (~64–72GB VRAM), 8‑bit on 2× H100, BF16 on 8× H100, with a recommended max context of 262K on a single H100 to avoid OOM. Practical guidance covers llama.cpp, vLLM, and TensorRT‑LLM paths; the takeaway is a 120B‑class reasoning model running in under 30 minutes on a single H100 at 4‑bit. 3 4
Nemotron 3 Content Safety and VoiceChat
Agentic systems need guardrails and natural voice. Nemotron 3 Content Safety is a compact 4B-parameter multimodal classifier (Gemma‑3‑4B backbone + adapter head) that flags unsafe content across text and images in 12 languages, reaching about 84% accuracy on multimodal, multilingual harmful-content benchmarks while keeping latency low for in‑line moderation. It supports a 23‑class taxonomy (hate, harassment, violence, sexual content, etc.), with a toggle between fast binary decisions and granular categories. 2
For real-time talk, Nemotron 3 VoiceChat is a 12B end‑to‑end, full‑duplex speech model (ASR+LLM+TTS in one) targeting sub‑300ms end‑to‑end latency, streaming 80ms audio chunks faster than real time. Early-access results place it in the “most attractive” quadrant on Artificial Analysis’s speech-to-speech leaderboard, signaling both responsive turn-taking and solid speech reasoning for assistants that must sound natural and stay on task. 2
Nemotron 3 Nano Omni, Embeddings, and Reranking
To ground retrieval beyond plain text, NVIDIA previews Llama Nemotron Embed VL (1.7B dense) and Rerank VL (1.7B cross-encoder). On the ViDoRe V3/MTEB Pareto curve (accuracy vs tokens/s on one H100), Embed VL sits on the frontier—competitive accuracy at high throughput—supporting Matryoshka embeddings and millisecond-latency search for visual documents in standard vector DBs. Nano Omni (coming soon) targets production-grade “omni-understanding” across video, audio, documents, and GUIs via Conv3D and efficient video sampling. 2
Open Source & Repos
Open weights + NeMo tooling + community runbooks
Nemotron 3 Super lands fully open—weights, datasets, and training “recipes”—with NVIDIA NeMo Evaluator for reproducible, agent-aware benchmarking and the NeMo Agent Toolkit for profiling multi-agent latency, token costs, and orchestration overhead (LangChain, AutoGen, AWS Strands plug right in). This matters because multi-agent traces can be 15× longer than chats—developers need visibility and knobs like a configurable “thinking budget” to cap chain-of-thought costs. 1 2
Third-party guides distill hard requirements: on Blackwell, NVFP4 delivers up to 4× speed vs FP8 on Hopper; on Hopper/H200 you fall back to FP8 or quantized GGUF. Practical serving recipes span llama.cpp (single H100 Q4), vLLM (tensor parallel for 8-bit), and TensorRT‑LLM (highest QPS) with notes like “keep ctx ≤262K on single H100.” These details convert an impressive paper model into a dependable production endpoint. 3
External dashboards emphasize throughput: 478 tok/s on B200 and roughly 7.5× higher than Qwen3.5‑122B, plus 2.2× over GPT‑OSS‑120B in some reports—signaling that hybrid Mamba+MoE isn’t just elegant, it’s economically relevant when billed per token. Expect downward pressure on per‑token API pricing as these efficiency deltas harden. 3 4
Research Papers
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
Most audio–video generators juggle multiple streams and cross-attention. This paper shows you can simplify: daVinci‑MagiHuman uses a single-stream Transformer to process text, video, and audio in one token sequence via self-attention only. The result is synchronized, human-centric generation—expressive faces, body motion, and tight lip‑sync—without architectural sprawl. It supports multilingual speech (Mandarin/Cantonese, English, Japanese, Korean, German, French). 5 6
Efficiency tricks make it practical: distillation cuts denoising to 8 steps, latent-space super-resolution upsamples cleanly, and a Turbo VAE speeds decoding. On a single H100, it generates a 5‑second 256p clip in about 2 seconds (reportedly as low as 1.6 s), and 1080p in 38.4 s. Automated evals report best visual quality and text alignment among open models and a 14.60% word error rate for speech intelligibility—the lowest in its set—plus 80.0%/60.9% human preference wins over Ovi 1.1 and LTX 2.3 across 2,000 comparisons. 5 7
The trade-off of a single stream is less modality-specific specialization than multi-stream designs, and current tests focus on short (5 s) clips. Still, the message is clear: a simpler backbone can deliver high human-centric quality at speed, which lowers serving complexity for real products. 8
LongCat-Flash-Prover: Native Formal Reasoning via Agentic Tool-Integrated RL
Formal theorem proving demands exact syntax and verified logic, not just plausible math text. LongCat‑Flash‑Prover pairs a massive 560B-parameter MoE model with agentic tool‑integrated reasoning in Lean4, decomposing the task into auto‑formalization, sketching, and proving. A Hybrid‑Experts Iteration Framework grows high‑quality trajectories, and a new Hierarchical Importance Sampling Policy Optimization (HisPO) with gradient masking stabilizes long‑horizon MoE RL. 9 10
Results set new open-weight SOTA: 97.1% pass on MiniF2F‑Test with only 72 inference budget per problem, 70.8% on ProverBench, and 41.5% on PutnamBench with ≤220 attempts—significantly above prior open baselines. The system also employs theorem consistency and legality detectors to prevent “reward hacking,” i.e., passing broken proofs. The main caveat is scale: 560B MoE models and tool-integrated RL are compute-hungry. 9 11
Why it matters: native formal reasoning is inching from demos to utility. Open models that can auto‑formalize, plan lemma chains, and produce kernel‑verified proofs broaden access to verified math and software, and provide a testbed for safer reasoning policies. 12
Omni-WorldBench: Evaluating Interactive Response in 4D World Models
Video world models are shifting from pretty visuals to interactive, 4D generation—where actions cause credible state changes over time. Omni‑WorldBench fills a gap by measuring “interaction effect fidelity” alongside video quality and camera/object controllability. It bundles Omni‑WorldSuite (1,000+ prompts across domains and interaction levels) and Omni‑Metrics (agent-based measures like InterStab‑L/N for stability, InterCov for object‑level causal faithfulness, and InterOrder for temporal event ordering). 13 14
Across 18 models (text‑to‑video, image‑to‑video, camera‑controlled), the study shows strong visual smoothness doesn’t guarantee causal consistency—models often falter when interactions or camera schedules get complex. Image‑to‑video models tend to fare better interactively thanks to richer conditioning, but trade-offs persist between controllability and scene coherence. Wan2.2 tops the overall AgenticScore at 75.92% in reported summaries. 13 15
The broader point: if we want planners and agents that learn from videos, we must evaluate whether actions actually drive plausible outcomes—not just whether frames look good. Expect Omni‑WorldBench to become a reference for 4D, action‑conditioned generation. 16
Why It Matters
Agentic AI is about systems that plan, act, and adapt over long horizons. Today’s drops line up the stack: a long‑context, high‑throughput reasoning core (Nemotron 3 Super), safety and voice layers for real-time interaction, and evaluation frameworks and specialized models (AV generation, formal math, world modeling) that close gaps between pretty demos and reliable tools. The common thread is efficiency: MoE routing, Mamba sequences, NVFP4 precision, single‑stream AV backbones—all trading brute force for smarter computation. 1 2
If these pieces hold up in production—1M‑token memory without drift, sub‑300ms full‑duplex voice, safety that keeps up with multimodal prompts—we get agents that are cheaper, faster, and more aligned. That reshapes not just benchmarks but who can afford to deploy serious AI beyond a chatbox. 3
Comments (0)