The first systematic comparison of optimizers for tabular deep learning reveals that the Muon optimizer consistently outperforms the widely-used AdamW. The research, published April 16, 2026 on arXiv, benchmarked 15 optimizers across 17 tabular datasets to provide practitioners with evidence-based optimizer recommendations for MLP-based models.
Muon Demonstrates Consistent Performance Gains
The study found that Muon (MomentUm Orthogonalized by Newton-Schulz) consistently outperforms AdamW for both plain MLPs and modern MLP-based architectures. Muon works by taking updates generated by SGD-momentum and applying a Newton-Schulz iteration as a post-processing step to each update before applying them to the parameters. This orthogonalization step provides better conditioning for optimization.
The researchers conducted their evaluation in the standard supervised learning setting under a shared experiment protocol, ensuring fair comparisons across all 15 optimizers tested.
Performance Comes With Training Efficiency Trade-offs
While Muon delivers better accuracy, it comes at the cost of slower training due to the additional Newton-Schulz iteration applied to each update. The research team notes that Muon should be considered a strong and practical choice for practitioners and researchers if the associated training efficiency overhead is affordable. This trade-off means practitioners need to weigh accuracy gains against training time increases.
Exponential Moving Average Improves AdamW on Vanilla MLPs
The study uncovered an additional finding: exponential moving average (EMA) of model weights is a simple yet effective technique that improves AdamW on vanilla MLPs. However, the effect of EMA is less consistent across advanced MLP-based model variants, suggesting that the benefit is architecture-dependent.
Filling a Gap in Tabular Deep Learning Research
MLPs are heavily used backbones in modern deep learning architectures for supervised learning on tabular data, and AdamW has been the go-to optimizer. Unlike architecture design, however, the choice of optimizer for tabular deep learning has not been examined systematically, despite new optimizers showing promise in other domains. This study fills that gap with comprehensive benchmarking.
Practical Recommendations for Practitioners
Based on the findings, practitioners working with tabular data should:
- Consider Muon if accuracy is paramount and training time overhead is acceptable
- Try EMA with AdamW on vanilla MLPs as a simple improvement
- Base optimizer choice on specific model architecture, as effectiveness varies between vanilla MLPs and advanced variants
The research was authored by Yury Gorishniy, Ivan Rubachev, Dmitrii Feoktistov, and Artem Babenko. Code is available on GitHub at github.com/yandex-research/tabular-dl-optimizers, enabling practitioners to replicate results and apply the findings to their own tabular datasets.
Key Takeaways
- Muon optimizer consistently outperforms AdamW across 17 tabular datasets for both plain MLPs and modern MLP-based architectures
- The performance gain comes with slower training due to additional Newton-Schulz iteration overhead
- Exponential moving average (EMA) of model weights improves AdamW on vanilla MLPs but shows less consistent benefits on advanced variants
- This represents the first systematic benchmarking of optimizers specifically for tabular deep learning
- Code is publicly available on GitHub for practitioners to replicate and apply findings