Xiaomi MiMo-V2.5-Pro-UltraSpeed Achieves 1000+ Tokens/Second on 1T-Parameter Model

Xiaomi Breaks 1000 Tokens/Second Barrier With Trillion-Parameter Model

Xiaomi MiMo has released MiMo-V2.5-Pro-UltraSpeed, a 1-trillion-parameter language model that achieves unprecedented inference speeds of approximately 1000-1200 tokens per second during text generation. The model, developed in collaboration with TileRT and released on June 8, 2026, represents the first time a model of this scale has broken the 1000 tokens/second decode speed threshold on commodity GPUs.

FP4 Quantization Preserves Performance While Enabling Speed

The model employs selective FP4 quantization applied only to Mixture-of-Experts (MoE) components while maintaining original precision in other modules. This approach preserves reasoning and coding capabilities while dramatically reducing computational requirements. The architecture runs on a single standard 8-GPU commodity node, making extreme performance accessible without specialized hardware.

DFlash Speculative Decoding Eliminates Serial Constraints

MiMo-V2.5-Pro-UltraSpeed introduces DFlash, an innovative block-level masked parallel prediction method that removes traditional serial autoregressive constraints. Performance varies by task type:

Coding tasks: 6.30 average acceptance length
Math and reasoning: 5.56 average acceptance length
Agent tasks: 4.29 average acceptance length

The TileRT system infrastructure provides ultra-low-latency inference through persistent kernel architecture, eliminating traditional operator launch overhead and enabling microsecond-scale hardware-software optimization.

Limited Trial Access With Pricing Premium

Xiaomi is offering API-only access through a limited application process during a trial period from June 9-23, 2026 (Beijing Time). The service costs approximately 3× the standard MiMo-V2.5 Pro pricing while delivering roughly 10× faster generation speeds. Approved trial users receive free chat access at ultraspeed.xiaomimimo.com with limitations including maximum 10 queue entries per account daily, 30-minute session caps, and 5-minute idle auto-release.

The company has released FP4-quantized weights and DFlash parameters as open source on HuggingFace under the MiMo-V2.5-Pro-FP4-DFlash checkpoint, enabling researchers to experiment with the technology.

Key Takeaways

MiMo-V2.5-Pro-UltraSpeed achieves 1000-1200 tokens/second on a 1-trillion-parameter model, the first to break this barrier on commodity GPUs
Selective FP4 quantization applies only to MoE components while preserving original precision in reasoning and coding modules
DFlash speculative decoding achieves 6.30 average acceptance length on coding tasks through block-level masked parallel prediction
The model runs on a single standard 8-GPU commodity node and costs 3× standard pricing for approximately 10× faster generation
Open source weights and parameters are available on HuggingFace for research purposes

Xiaomi Breaks 1000 Tokens/Second Barrier With Trillion-Parameter Model

FP4 Quantization Preserves Performance While Enabling Speed

DFlash Speculative Decoding Eliminates Serial Constraints

MiMo-V2.5-Pro-UltraSpeed introduces DFlash, an innovative block-level masked parallel prediction method that removes traditional serial autoregressive constraints. Performance varies by task type:

Coding tasks: 6.30 average acceptance length

Math and reasoning: 5.56 average acceptance length

Agent tasks: 4.29 average acceptance length

Limited Trial Access With Pricing Premium

Key Takeaways

MiMo-V2.5-Pro-UltraSpeed achieves 1000-1200 tokens/second on a 1-trillion-parameter model, the first to break this barrier on commodity GPUs

Selective FP4 quantization applies only to MoE components while preserving original precision in reasoning and coding modules

DFlash speculative decoding achieves 6.30 average acceptance length on coding tasks through block-level masked parallel prediction

The model runs on a single standard 8-GPU commodity node and costs 3× standard pricing for approximately 10× faster generation

Open source weights and parameters are available on HuggingFace for research purposes