Nvidia released Cosmos 3 on May 31, 2026, marking a significant advancement in physical AI by unifying capabilities previously separated across different models. The foundation model combines world understanding and action generation in a single architecture, supporting text, image, video, audio, and action inputs and outputs for robotics, autonomous systems, and simulation applications.
Cosmos 3 Features Mixture-of-Transformers Architecture With Dual Towers
Cosmos 3 uses a Mixture-of-Transformers (MoT) architecture comprising two integrated towers: a Reasoner Tower functioning as a vision-language model with autoregressive architecture, and a Generator Tower employing diffusion-based systems to produce physics-aware video and action sequences. This unified approach eliminates the need for separate models for reasoning and generation tasks.
The model comes in two configurations: Cosmos 3 Nano with 16 billion parameters optimized for real-time inference on workstation GPUs like RTX PRO 6000, and Cosmos 3 Super with 64 billion parameters designed for datacenter deployment on Hopper and Blackwell GPUs.
Nvidia Provides Comprehensive Open-Source Resources and Synthetic Datasets
Nvidia released complete training infrastructure including model checkpoints on Hugging Face, open-source training and post-training scripts on GitHub, and six synthetic datasets covering robotics, physics, spatial reasoning, human motion, autonomous driving, and warehouse operations. The company also provides NVIDIA NIM microservices for production deployment.
Performance optimizations include BF16, FP8, and NVFP4 quantization, with NVFP4 achieving up to 2x speedup. The system integrates with vLLM for continuous batching and tensor parallelism, and features Efficient Video Sampling to reduce video tokens. Post-training workflows support supervised fine-tuning, action post-training for dynamics and policy generation, and customization options.
Cosmos 3 Leads Multiple Benchmarks Including VANTAGE-Bench and R-Bench
Cosmos 3 demonstrates state-of-the-art performance across multiple evaluation frameworks:
- Tops VANTAGE-Bench for reasoning on real-world fixed-camera footage
- Leads Traffic Anomaly Reasoning (TAR) for AI City Challenge 2026
- Ranks first on Artificial Analysis leaderboard for open-source text-to-image and image-to-video models
- Achieves top scores on R-Bench, PAI-Bench, Physics-IQ, and RoboLab benchmarks
Nvidia also introduced the Cosmos Human Evaluation (HUE) framework for evaluating video generation through atomic binary verification across semantic alignment, physical laws, geometric reasoning, and visual integrity, addressing limitations of automated leaderboards.
Key Takeaways
- Nvidia Cosmos 3 unifies world understanding and action generation in a single physical AI model, released May 31, 2026
- Available in two sizes: Cosmos 3 Nano (16B parameters) for workstation GPUs and Cosmos 3 Super (64B parameters) for datacenter deployment
- Nvidia provides complete open-source infrastructure including model checkpoints, training scripts, and six synthetic datasets covering robotics and autonomous systems
- NVFP4 quantization delivers up to 2x speedup while maintaining performance across benchmarks
- Cosmos 3 leads multiple industry benchmarks including VANTAGE-Bench, Traffic Anomaly Reasoning, and the Artificial Analysis leaderboard for open-source models