NVIDIA's Nemotron 3 Super pushes open, faster long-context reasoning with a hybrid MoE design
A 120B-parameter hybrid Mamba-Transformer activates just 12B per token, serves 1M context, and claims up to 7.5x higher throughput than rivals — with weights and datasets on Hugging Face. It lands amid a broader MoE wave spanning text-to-image and open LLMs.
One-Line Summary
Open Mixture-of-Experts designs move from labs to practice — NVIDIA’s Nemotron 3 Super, Google’s Gemma 4, and new robotics and diffusion models show faster, longer-context reasoning with weights and tooling that teams can actually deploy.
LLM & SOTA Models
Nemotron 3 Super: Open, efficient long-context reasoning for agents
NVIDIA’s Nemotron 3 Super is a large AI model built to think through multi-step tasks faster while keeping costs in check — it does this by activating only a small portion of its 120B parameters per token and by supporting up to a 1M-token context window. In plain terms: it can hold very long instructions or histories in one go, and still run quickly. The paper reports a 120B total (12B active) hybrid Mamba-Attention Mixture-of-Experts with new LatentMoE layers, native speculative decoding (MTP), pretraining in NVFP4 on 25T tokens, and post-training via SFT and RL. The model reaches comparable benchmark accuracy to peers but claims up to 2.2x and 7.5x higher inference throughput than GPT-OSS-120B and Qwen3.5-122B, respectively, with base, post-trained, and quantized checkpoints plus datasets released on Hugging Face. 1
Under the hood, the hybrid design combines Mamba-style sequence modeling for speed with Transformer attention for global coherence, while the LatentMoE routing selectively activates experts to improve “accuracy per FLOP” and “accuracy per parameter.” The team also adds MTP layers for native speculative decoding, which accelerates generation without external helpers. Taken together, these choices aim to make a very large model feel small at serving time, particularly on long-context agent workflows that accumulate many tool calls or steps. 1
Why it matters: open weights plus quantized checkpoints make it easier for enterprises to pilot long-context agents without vendor lock-in. Early commentary also highlights practical runs on modest hardware using pruning and quantization, though such community claims vary by setup. The directional signal is clear: more organizations can test frontier-scale reasoning locally, with Nemotron 3 Super positioned for agentic tasks that benefit from million-token memory. 2
As context, MoE efficiency is spreading beyond text. Nucleus-Image, a 17B text-to-image diffusion model that activates only ~2B parameters per pass via sparse MoE, matches or beats Qwen-Image, GPT Image 1, Seedream 3.0, and Imagen 4 on GenEval (0.87), DPG-Bench (88.79), and OneIG-Bench (0.522) — with weights, training code, and the recipe publicly released. This underscores the broader pattern: MoE is no longer LLM-only and can lower serving costs while raising quality when engineered carefully. 3
Gemma 4: Google’s most capable open-weight models under Apache 2.0
Gemma 4 is Google DeepMind’s newest family of open models built for advanced reasoning and agent workflows — released in four sizes (Effective 2B, Effective 4B, 26B MoE, 31B Dense) with commercially permissive Apache 2.0 licensing. The 31B ranks #3 among open models on Arena AI’s text leaderboard, and the 26B MoE ranks #6, while smaller E2B/E4B target on-device use with multimodal input and 128K context; the larger models support up to 256K context. In short, Gemma 4 aims to deliver “intelligence-per-parameter”: strong quality on accessible hardware with native function calling and structured outputs for agents. 4
Beyond headline rankings, Google emphasizes agent-ready features: tool calling, JSON output, system instructions, vision and audio (E2B/E4B for audio), and long context for passing repositories or long documents in one prompt. Developers can get weights from Hugging Face, Kaggle, and Ollama, and run on popular stacks like vLLM, llama.cpp, MLX, and NVIDIA NIM, with deployment paths across Android, workstations, and Google Cloud. The Apache 2.0 license removes commercial restrictions and MAU caps, simplifying enterprise adoption compared to custom open-model licenses. 4
Analyses note large quality jumps over Gemma 3 on math and coding benchmarks and highlight the 26B MoE’s efficiency — activating about 3.8–4B parameters per token while competing with much larger models. Community and ecosystem posts also flag practical trade-offs like KV-cache memory at 256K context, which can raise VRAM needs on consumer GPUs. Still, the net effect is a strong, permissively licensed open baseline for agentic apps across devices. 5
Gemini Robotics-ER 1.6: Spatial logic and “instrument reading” for real-world robots
Gemini Robotics-ER 1.6 is a reasoning-first model that helps robots understand space and complete tasks more reliably — think pointing to precise locations, planning steps, and knowing when a task is actually finished. This update improves spatial reasoning, multi-view success detection (merging overhead and wrist cameras), and introduces “instrument reading,” enabling robots to read gauges and sight glasses, developed with Boston Dynamics for facility inspection scenarios. It’s available now via the Gemini API and Google AI Studio. 6
Practically, “pointing” is used as an intermediate step: the model identifies counts, grasp points, or motion paths via points before acting. Success detection is the autonomy engine — the model evaluates outcomes to decide whether to retry or proceed, especially in challenging conditions like occlusion or poor lighting. For industrial settings, accurate gauge reading combines zooming, code execution for measurement, and world knowledge to interpret units and thresholds into usable decisions. 7
DeepMind frames 1.6 as its safest robotics model yet, with better adherence to spatial safety constraints and improved hazard detection in text and video tests over baseline Gemini 3.0 Flash. Developers can start with provided Colab examples and integrate the ER model as the high-level planner that calls tools like vision-language-action modules or search. 6
Open Source & Repos
NVIDIA Ising: Open AI models for quantum calibration and error correction
NVIDIA Ising is a family of open models meant to speed up two blockers in quantum computing: calibrating quantum processors and decoding errors fast enough to be useful. In everyday terms: it’s AI that helps tune fragile quantum machines more quickly and catch their inevitable mistakes. NVIDIA describes Ising Calibration (a vision-language model) that can interpret processor measurements and automate calibration steps, and Ising Decoding (3D CNN variants) for real-time error correction — reporting up to 2.5x faster and 3x more accurate performance than pyMatching on stated tasks. 8
The company says leading labs and companies are already adopting the models, and it’s shipping a cookbook of workflows, data, and NIM microservices so teams can adapt to their specific hardware while retaining control of data and infra. Ising complements CUDA-Q for hybrid quantum-classical setups and NVQLink for QPU–GPU interconnects aimed at real-time control. For a field still wrestling with qubit instability, the promise is shaving calibration from days to hours and making error correction more practical. 9
Markets noticed: coverage reports a rally in Asian IT and quantum stocks following the debut, reflecting expectations that better tooling can accelerate timelines to useful applications. For practitioners, the near-term value is clearer iteration cycles and shared baselines that reduce duplicated effort across hardware stacks. 10
Research Papers
ClawGUI: A full-stack framework to train, test, and ship GUI agents
ClawGUI is an open-source framework that tackles a simple reality: many AI agents can’t reliably click through real apps. It unifies three missing pieces — online reinforcement learning training that works on both virtual environments and real devices, standardized evaluation across 6 benchmarks, and deployment to Android/HarmonyOS/iOS via 12+ chat platforms with hybrid CLI–GUI control. In this pipeline, the 2B-parameter ClawGUI-2B hits a 17.1% success rate on MobileWorld GUI-Only, beating an equal-size baseline by 6.0%. 11
Technically, ClawGUI-RL integrates GiGPO with a Process Reward Model for dense step-level feedback, while ClawGUI-Eval enforces a reproducible evaluation protocol (95.8% reproduction vs official baselines). ClawGUI-Agent focuses on shipping: persistent personalized memory and cross-OS support put trained agents in users’ hands — a step many research systems skip. The authors argue that infrastructure, not raw model size, is the main bottleneck for GUI agents. 11
Analyst coverage frames it as “missing infrastructure” now bundled in one harness, reducing the glue work teams typically rebuild. The bet is that consistent RL training loops, apples-to-apples evaluation, and real-device deployment will convert today’s lab demos into reliable workflows across everyday apps. 12
Community Pulse
Hacker News (207↑) — Gemini Robotics-ER 1.6 draws praise for ambition, with concerns about power needs and privacy when adding cameras to public spaces.
"It's a fascinating proposition and no doubt they'll get bigger models in there, and likely be able to cluster multiple models for mega MOE. One thing that would really be great is if they could take the power requirements down -- the chip requires 2.5KW, which is modest in terms of what the big boys use but would be an issue on a battery powered robot." — Hacker News
"I wonder how the municipal employees would react to cameras suddenly appearing on the meters around my house." — Hacker News
Hacker News (77↑) — ClawGUI lands amid uncertainty over agent frameworks; builders weigh Autogen, LangChain, LangGraph, or rolling their own.
"Finding it quite difficult to decide which platform to bet on. Autogen langchain and langgraph seem to be main contenders. And then people seem to custom roll them too" — Hacker News
Why It Matters
Open MoE and reasoning-first designs are converging on the same promise: higher quality at lower serving cost, with long context so agents and robots can keep more state in mind. Nemotron 3 Super and Gemma 4 show that open-weight options now reach into workflows — on-device, on-prem, and cloud — without legal friction, while NVIDIA Ising and Gemini Robotics-ER 1.6 target the physical frontiers of quantum labs and factory floors. 1 4 9 6
Try This Week
- Hugging Face demo: Generate images with Nucleus-Image MoE and enable the one-line Text KV cache in diffusers for faster sampling. https://huggingface.co/NucleusAI/NucleusMoE-Image
- Google AI Studio: Test Gemini Robotics-ER 1.6’s pointing and success detection on your own images or videos. https://aistudio.google.com/prompts/new_chat?model=gemini-robotics-er-1.6-preview
Comments (0)