Xiaomi Breaks 1000 Tokens/Second Barrier With Trillion-Parameter Model
Xiaomi MiMo has released MiMo-V2.5-Pro-UltraSpeed, a 1-trillion-parameter language model that achieves unprecedented inference speeds of approximately 1000-1200 tokens per second during text generation. The model, developed in collaboration with TileRT and released on June 8, 2026, represents the first time a model of this scale has broken the 1000 tokens/second decode speed threshold on commodity GPUs.
FP4 Quantization Preserves Performance While Enabling Speed
The model employs selective FP4 quantization applied only to Mixture-of-Experts (MoE) components while maintaining original precision in other modules. This approach preserves reasoning and coding capabilities while dramatically reducing computational requirements. The architecture runs on a single standard 8-GPU commodity node, making extreme performance accessible without specialized hardware.
DFlash Speculative Decoding Eliminates Serial Constraints
MiMo-V2.5-Pro-UltraSpeed introduces DFlash, an innovative block-level masked parallel prediction method that removes traditional serial autoregressive constraints. Performance varies by task type:
- Coding tasks: 6.30 average acceptance length
- Math and reasoning: 5.56 average acceptance length
- Agent tasks: 4.29 average acceptance length
The TileRT system infrastructure provides ultra-low-latency inference through persistent kernel architecture, eliminating traditional operator launch overhead and enabling microsecond-scale hardware-software optimization.
Limited Trial Access With Pricing Premium
Xiaomi is offering API-only access through a limited application process during a trial period from June 9-23, 2026 (Beijing Time). The service costs approximately 3× the standard MiMo-V2.5 Pro pricing while delivering roughly 10× faster generation speeds. Approved trial users receive free chat access at ultraspeed.xiaomimimo.com with limitations including maximum 10 queue entries per account daily, 30-minute session caps, and 5-minute idle auto-release.
The company has released FP4-quantized weights and DFlash parameters as open source on HuggingFace under the MiMo-V2.5-Pro-FP4-DFlash checkpoint, enabling researchers to experiment with the technology.
Key Takeaways
- MiMo-V2.5-Pro-UltraSpeed achieves 1000-1200 tokens/second on a 1-trillion-parameter model, the first to break this barrier on commodity GPUs
- Selective FP4 quantization applies only to MoE components while preserving original precision in reasoning and coding modules
- DFlash speculative decoding achieves 6.30 average acceptance length on coding tasks through block-level masked parallel prediction
- The model runs on a single standard 8-GPU commodity node and costs 3× standard pricing for approximately 10× faster generation
- Open source weights and parameters are available on HuggingFace for research purposes