NVIDIA Releases Nemotron 3 Ultra: 550B-Parameter MoE Model at 300+ Tokens/Second

NVIDIA launched Nemotron 3 Ultra on June 4, 2026, at Computex 2026, marking the company's most ambitious open-weights model release to date. The 550-billion-parameter model achieves over 300 output tokens per second through a hybrid Mamba-Transformer architecture with mixture-of-experts sparsity, positioning NVIDIA as a major player in both AI infrastructure and foundation models.

Hybrid Architecture Enables 10x Parameter Sparsity

Nemotron 3 Ultra employs a mixture-of-experts design that activates only 55 billion of its 550 billion total parameters per token—a 10x sparsity ratio. The architecture interleaves Mamba-2 state-space layers with selective attention layers, combining the efficiency of structured state-space models with the expressiveness of transformer attention. This hybrid approach enables the model to process a 1-million-token context window while maintaining throughput exceeding 300 tokens per second.

Highest Performance Among US Open-Weight Models

Nemotron 3 Ultra achieved a score of 48 on the Artificial Analysis Intelligence Index, making it the highest-performing US-built open-weight model as of its launch. However, NVIDIA acknowledged that the model still trails Chinese frontier models in benchmark performance. The company made the model available through HuggingFace, OpenRouter, and NVIDIA NIM, enabling broad access for developers and enterprises.

Enterprise Adoption and Integration

Glean added support for Nemotron 3 Ultra shortly after launch, signaling enterprise interest in the model's capabilities. AWS made it available on SageMaker JumpStart for cloud deployment. Industry reports suggested NVIDIA could form a major alliance with Apple following the launch, potentially integrating Nemotron 3 Ultra into Siri for improved natural language understanding.

Key Takeaways

NVIDIA's Nemotron 3 Ultra features 550 billion total parameters with 55 billion active per token through mixture-of-experts architecture
The model achieves over 300 output tokens per second with a 1-million-token context window
Nemotron 3 Ultra scored 48 on the Artificial Analysis Intelligence Index, the highest of any US-built open-weight model
The hybrid Mamba-Transformer architecture interleaves Mamba-2 state-space layers with selective attention mechanisms
Enterprise platforms including Glean and AWS SageMaker JumpStart added support for the model shortly after launch

Hybrid Architecture Enables 10x Parameter Sparsity

Highest Performance Among US Open-Weight Models

Enterprise Adoption and Integration

Key Takeaways

NVIDIA's Nemotron 3 Ultra features 550 billion total parameters with 55 billion active per token through mixture-of-experts architecture

The model achieves over 300 output tokens per second with a 1-million-token context window

Nemotron 3 Ultra scored 48 on the Artificial Analysis Intelligence Index, the highest of any US-built open-weight model

The hybrid Mamba-Transformer architecture interleaves Mamba-2 state-space layers with selective attention mechanisms

Enterprise platforms including Glean and AWS SageMaker JumpStart added support for the model shortly after launch