NVIDIA’s Nemotron 3 Super marries Mamba, latent-MoE, and MTP to crush agentic context and latency
A 1M-token hybrid Mamba-Transformer MoE with native 4-bit pretraining and multi-token prediction lands—plus a 754B MoE from Z.AI and a new diffusion decoding method that pushes parallel generation further.
One-Line Summary
NVIDIA ships an open hybrid MoE model built for long-horizon agents as new papers push faster decoding, embodied perception-action, and the infrastructure (memory, transport, UX) agents actually need.
LLM & SOTA Models
Nemotron 3 Super
NVIDIA introduces Nemotron 3 Super, a fully open-weight model with 120B total and 12B active parameters using a hybridMamba–Transformer Mixture-of-Experts ** (MoE) backbone to cut the "thinking tax" in multi-agent apps. It targets agentic workflows with a native1M-token context window**, posting85.6% on PinchBench as the best open model in its class; throughput is claimed**>5×** vs the prior Super. 1
Under the hood: latent MoE compresses tokens before routing so the model can consult4× more experts at the same inference cost;multi-token prediction ** (MTP) forecasts several future tokens per pass for up to3× wall-clock speedups** in structured generation (and acts like built-in speculative decoding).Mamba-2 layers give linear-time sequence handling for the 1M window, while interleaved attention layers keep precise associative recall. 1
Training-wise, Super is pretrained on 25T tokens usingNVFP4 (NVIDIA’s 4-bit float) from the first gradient step, which NVIDIA says yields4× faster inference on B200 vs FP8 on H100 while maintaining accuracy. Post-training uses~7M SFT samples then multi-environmentreinforcement learning across 21 configs with**>1.2M environment rollouts** via NeMo Gym/RL to align behavior to multi-step agent tasks. 1
NVIDIA pitches a "Super + Nano" pattern: use the smaller Nemotron 3 Nano for routine steps and switch up to Super for complex planning and long-context reasoning—particularly in software engineering and cybersecurity triage where multi-agent systems can emit up to 15× more tokens than standard chats. The release includes open weights, datasets, and recipes for customization. 1
GLM-5.1 (Z.AI)
Z.AI unveils GLM-5.1, an open-weight754B-parameter MoE with200K context and128K max output tokens, designed for long-horizon agentic engineering. It reportsSWE-Bench Pro 58.4 SOTA, outscoring GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, plus broad gains on AIME 2026 (95.3), GPQA-Diamond (86.2), and a suite of agent/tool-use benchmarks. 2
Architecturally it pairs MoE with DSA ** (Dynamic Sparse Activation) and introducesasynchronous reinforcement learning** to decouple generation from training, claiming better long-horizon learning. The system sustains8-hour autonomous execution on single tasks, demonstrating 178-round iterations on a vector DB task and an automated CUDA kernel optimization improving from2.6× to 35.7× speedup. 2
GLM-5.1 targets both local and API deployment, listing support in SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+), with MIT-licensed weights—a nod to teams considering self-hosted agent stacks. 2
Open Source & Repos
vLLM Roadmap and Issues (Q1 2026)
The vLLM tracker highlights active performance and correctness work that matters in production agents: FP8 checkpoint quirks (Gemma 4 31B) causing repetitive outputs, prefix-caching non-determinism at greedy T=0, DP replica underutilization on Qwen3-8B with mixed tensor/data parallelism, and an RFC forpre-Hopper-usable 4-bit KV cache—all details that directly impact reliability and throughput. 3
Operational issues like CUBLAS allocation failures and S3 streamer crashes reflect the reality that serving MoE and long-context models is as much systems engineering as modeling; getting p2p/GPU interconnect detection right and ironing out ROCm corner cases can unlock multi-GPU scaling or block it. The Q1 2026 roadmap threads these into a coherent push for stability at scale. 3
For teams adopting Nemotron/GLM-class models, the message is clear: watch serving stacks as closely as training code—issues like KV precision, prefix caching, and tool-call tokenization determine whether you see the promised 2–3× speedups from techniques like speculation or get sunk by edge cases. 3
DMax Code Release
The authors of DMax release code showing how to push diffusion language models (dLLMs) into more aggressive parallel decoding while preserving quality. Building on a reformulation from mask-to-token steps into self-refinement in embedding space, DMax unifies masked and uniform dLLMs viaOn-Policy Uniform Training, enabling recovery from both masked inputs and the model’s own errors. 4
The accompanying implementation demonstrates Soft Parallel Decoding, where intermediate decoding states interpolate between predicted-token and mask embeddings, allowing iterative self-revision. Ontwo H200 GPUs, they report an average1,338 tokens/sec ** (TPS) at batch size 1, withTPF (tokens per forward) on GSM8K rising from 2.04 to 5.47** vs LLaDA-2.0-mini while keeping accuracy. 4
In context with mainstream speculative decoding (now standard in vLLM and TensorRT-LLM), DMax offers a path to parallelism without a separate draft model—complementing production wins like3.6× throughput on H200 and typical2–3× speedups with draft-based systems. 5
HY-Embodied-0.5 (Tencent Hunyuan)
Tencent’s HY-Embodied-0.5 family targets real-world embodied agents with aMixture-of-Transformers ** (MoT) backbone and latent tokens for fine-grained visual perception. Two variants ship: an edge-friendly2B activated-parameter** model and a32B model for heavier reasoning—both open-sourced with code and models. 6
The suite pairs perception with iterative, self-evolving post-training and on-policy distillation to transfer capability from the 32B to the 2B variant. Evaluated on22 benchmarks across perception, spatial reasoning, and embodied understanding,MoT-2B leads similarly sized peers on16 benchmarks; the32B model reports performance comparable to Gemini 3.0 Pro, with downstream**Vision-Language-Action ** (VLA) control validated on real robots. 6
The broader embodied-AI space is also probing physical assistance via **generative EMS ** (electrical muscle stimulation), using multimodal AI to produce context-aware muscle instructions constrained by joint limits—an indication that perception-to-action pipelines are maturing beyond labs to user studies and CHI’26 recognition. 7
Research Papers
DMax: Aggressive Parallel Decoding for dLLMs
DMax tackles the core problem of error accumulation in parallel decoding by reframing generation as progressive self-refinement in embedding space rather than discrete mask-to-token jumps. The training recipe, On-Policy Uniform Training, prepares the model to clean up both masked inputs and its own missteps, making aggressive parallelism feasible without collapse. 4
The decoding side’s Soft Parallel Decoding interpolates between token and mask embeddings during intermediate steps, enabling iterative, parallel self-revision. Results show strong throughput gains: on GSM8K,TPF increases from 2.04 → 5.47, and on MBPP from2.71 → 5.86, at comparable accuracy;1,338 TPS is reported on dual H200s at batch 1. 4
Positioned against production speculative decoding—now baked into vLLM and TensorRT-LLM with frequent2–3× latency cuts and**>3×** throughput in favorable cases—DMax hints at a unification path where multi-token forecasts and error-aware refinement live inside one model, reducing draft-model complexity. 5
HY-Embodied-0.5: Embodied Foundation Models
The HY-Embodied-0.5 paper argues that embodied intelligence needs stronger spatial-temporal perception and reasoning than generic VLMs. Its MoT design allocates modality-specific computation and augments withlatent tokens to densify perceptual representation—yielding a2B edge model that wins16/22 benchmarks vs peers and a32B model competitive with Gemini 3.0 Pro. 6
A key systems contribution is the training pipeline: an iterative self-evolving post-training process plus on-policy distillation to compress capabilities into the 2B variant without catastrophic loss. They further ground the models by training aVLA controller on top and validating on physical tasks, moving beyond static tests to end-to-end actuation. 6
Parallel work on generative EMS shows another track for embodied assistance: multimodal AI that yields muscle-stimulation sequences respecting human constraints, with user studies across 12 tasks and a CHI’26 Best Paper—evidence the field is converging on safe, context-aware assistance. 7
Externalization in LLM Agents: Memory, Skills, Protocols, Harness
This systems-level review argues that modern agents gain capability less by changing weights and more by reorganizing runtime via externalized memory, reusable skills, interaction protocols, and a unifying "harness" that coordinates them. It frames progress as a shift from weights → context → harness, and analyzes trade-offs between parametric vs externalized capability. 8
Complementary essays emphasize why memory is becoming the bottleneck as horizons grow: agents fail not on single-step plausibility but by losing the thread across hours—mis-compression, stale plans, or wrong retrieval. They highlight OpenAI’s 1M context + compaction, Anthropic’s harness design, and METR’s time-horizon framing: as horizons increase, memory machinery and recovery loops dominate reliability. 9
Meta’s reported "HyperAgents" work adds a striking datapoint: self-referential agents that evolve their own harnesses—persistent memory, performance tracking, multi-stage verification—suggesting that harness components are a convergent architecture agents rediscover when optimizing themselves. NVIDIA-focused overviews underscore the hardware-software stack (HBM bandwidth, Transformer Engine) as an enabler for large-context recall and fast retrieval. 10 11
KnowU-Bench: Personalized, Proactive Mobile Agents
KnowU-Bench tests whether agents can infer hidden user preferences and calibrate proactive assistance in a live Android GUI emulator. It includes 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks, hiding the user profile and exposing only behavioral logs so agents must elicit and infer preferences—not just read context. 12
Results show a sharp drop: models that excel at explicit instructions fall below 50% on vague or preference-dependent tasks, even for frontier systems like Claude Sonnet 4.6. The bottlenecks are less navigation and more preference acquisition and intervention calibration (e.g., when to act, ask consent, or stand down). 12
Related work on software testing agents finds that generating tests (SWT-Bench) is a powerful lens on code agents: fail-to-pass and coverage metrics improve diagnosis, and generated tests can double SWE-Agent precision—hinting that explicit, external artifacts (tests, policies) are key for trustworthy autonomy. Essays on "Agent Experience (AX)" and transport benchmarks show that stateful continuations (e.g., WebSocket mode) cut client-sent bytes by ~82–86% and improve end-to-end time15–29%, reinforcing that infrastructure choices are now first-order for agent UX. 13 14 15
Why It Matters
Agent capability is now a systems story: hybrid backbones (Mamba–Transformer), MoE routing, multi-token prediction, and native 4-bit training raise the ceiling; parallel decoding and speculation unlock the throughput; embodied models and proactive benchmarks push into real-world action; and harness, memory, transport, and AX decide whether long runs stay coherent. Together, these moves target the same pain points: context explosion, goal drift, and latency under tool-heavy loops. 1 5 12
For practitioners, the numbers are actionable: 1M-token contexts,>5× throughput jumps,2–3× decoding speedups,82–86% payload reductions from stateful transport, and SOTA scores on real engineering benchmarks. The edge is going to teams that co-design model, inference, and runtime infrastructure so agents don’t just think better—they sustain progress over hours. 1 2 15
Comments (0)