News

Nvidia Nemotron 3 Super

Solving the agentic AI efficiency problem.

Jon Peddie

Nvidia just solved two of the biggest headaches holding back agentic AI—runaway token costs and agents losing track of what they were supposed to do. Nemotron 3 Super packs 120 billion parameters but only activates 12 billion at a time, keeping inference fast and affordable. A 1 million-token memory window keeps agents on task across complex, long-running workflows. The result: autonomous AI agents that can finally run reliably in real production environments.

Multi-agent AI systems generate up to 15× the tokens of standard LLM interactions—resending history, tool outputs, and reasoning steps at every turn. Two compounding problems emerge: context explosion, where agents lose alignment with original objectives over long tasks, and the thinking tax, where routing every sub-task through massive reasoning models makes agentic applications too expensive and slow for production deployment.

Nvidia’s Nemotron 3 Super addresses both directly. The 120 billion-total-parameter, 12 billion-active-parameter model uses a hybrid MoE architecture that calls 4× more expert specialists at the same inference cost by compressing tokens before expert routing. A native 1 million-token context window eliminates goal drift by giving agents persistent long-term memory across extended task sequences. Multi-token prediction generates multiple tokens per forward pass, cutting generation time for long sequences and enabling built-in speculative decoding. A hybrid Mamba-Transformer backbone delivers 4× memory and compute efficiency gains, while native NVFP4 pretraining on Blackwell achieves 4× inference speedup on B200 versus FP8 on H100.

Post-training used RL across 21 environment configurations via NeMo Gym, with 1.2 million-plus environment rollouts targeting agent-specific workflows. On PinchBench—the benchmark evaluating LLM performance as the reasoning core of an OpenClaw agent—Nemotron 3 Super scores 85.6%, leading all open models. Target applications span software engineering, cybersecurity triage, life sciences research, and enterprise IT service management. The model ships with open weights, datasets, and recipes via build.nvidia.com, Hugging Face, and partners including Together AI.

What do we think?

Nemotron 3 Super directly attacks the two constraints that have kept agentic AI in proof-of-concept deployments—token economics and context coherence. The 12B-active-parameter design at 120B total capacity is the key architectural insight: It delivers frontier reasoning performance at a fraction of the inference cost. Combined with the 1M-token context window, this makes sustained, multi-step autonomous agent workflows economically viable at enterprise scale for the first time.

Nemotron 3 Super marks an inflection point in agentic AI deployment—not in model capability alone, but in the economics that determine whether agents run in production or remain in labs. The genuine inflection point arrives when token cost and context coherence no longer constrain multi-agent system design. Nvidia’s MoE efficiency architecture and 1M-token window cross both thresholds simultaneously. For semiconductor vendors, the inference compute implication is direct: Production-grade agentic workloads sustain dramatically higher continuous inference loads than conversational AI—accelerating demand for Blackwell-class NPU and GPU silicon at the edge and in the data center.

LIKE WHAT YOU SAW HERE? SHARE THE EXPERIENCE, TELL YOUR FRIENDS.