Researchers have developed DECO (Dense-Comparable Performance on End-Side Devices), a sparse Mixture-of-Experts architecture that matches dense Transformer performance while activating only 20% of experts and delivering 3× speedup on real hardware. The research, published on arXiv on May 11, 2026, by Chenyang Song, Weilin Zhao, and colleagues, addresses critical deployment challenges for large language models on resource-constrained devices.
Solving MoE's Storage and Memory-Access Bottlenecks
While Mixture-of-Experts architectures scale model capacity without proportionally increasing computation, their massive parameter footprints create significant storage and memory-access bottlenecks. These challenges hinder efficient deployment on edge devices like smartphones, which require simultaneous high performance, low computational cost, and small storage overhead.
DECO achieves dense-comparable performance under identical total parameter budgets and training tokens, making it viable for real-world edge deployment scenarios where storage and memory constraints are critical.
Three Key Architectural Innovations
The architecture introduces ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances contributions between routed and shared experts. This approach enables more efficient expert utilization than traditional gating mechanisms.
Second, DECO implements a NormSiLU activation function that normalizes inputs prior to SiLU operators. This produces a more stable trend in routed-expert activation ratios and achieves higher intrinsic sparsity, allowing fewer experts to be activated while maintaining performance.
Third, the architecture uses non-gated MLP experts with ReLU-based routing, indicating possibilities for MoE architecture simplification. This design choice reduces computational overhead while maintaining competitive performance.
Real Hardware Performance with 20% Expert Activation
DECO activates only 20% of experts while matching dense Transformer performance and outperforming established MoE baselines. The specialized acceleration kernel delivers 3.00× speedup on real hardware compared with dense inference, making it practical for smartphone and edge device deployment.
The research team states that code and checkpoints will be released, enabling broader adoption and further research. This represents significant progress toward deploying large language models on resource-constrained devices while maintaining competitive performance levels.
Key Takeaways
- DECO sparse MoE architecture matches dense Transformer performance while activating only 20% of experts, published on arXiv May 11, 2026
- Achieves 3.00× speedup on real hardware compared with dense inference using specialized acceleration kernels
- Introduces ReLU-based routing with learnable expert-wise scaling that adaptively balances routed and shared expert contributions
- NormSiLU activation function produces more stable routed-expert activation ratios and higher intrinsic sparsity
- Designed specifically for edge device deployment where storage, memory access, and computational efficiency are critical constraints