Muon Optimizer Outperforms AdamW for Tabular Deep Learning on MLPs

The first systematic comparison of optimizers for tabular deep learning reveals that the Muon optimizer consistently outperforms the widely-used AdamW. The research, published April 16, 2026 on arXiv, benchmarked 15 optimizers across 17 tabular datasets to provide practitioners with evidence-based optimizer recommendations for MLP-based models.

Muon Demonstrates Consistent Performance Gains

The study found that Muon (MomentUm Orthogonalized by Newton-Schulz) consistently outperforms AdamW for both plain MLPs and modern MLP-based architectures. Muon works by taking updates generated by SGD-momentum and applying a Newton-Schulz iteration as a post-processing step to each update before applying them to the parameters. This orthogonalization step provides better conditioning for optimization.

The researchers conducted their evaluation in the standard supervised learning setting under a shared experiment protocol, ensuring fair comparisons across all 15 optimizers tested.

Performance Comes With Training Efficiency Trade-offs

While Muon delivers better accuracy, it comes at the cost of slower training due to the additional Newton-Schulz iteration applied to each update. The research team notes that Muon should be considered a strong and practical choice for practitioners and researchers if the associated training efficiency overhead is affordable. This trade-off means practitioners need to weigh accuracy gains against training time increases.

Exponential Moving Average Improves AdamW on Vanilla MLPs

The study uncovered an additional finding: exponential moving average (EMA) of model weights is a simple yet effective technique that improves AdamW on vanilla MLPs. However, the effect of EMA is less consistent across advanced MLP-based model variants, suggesting that the benefit is architecture-dependent.

Filling a Gap in Tabular Deep Learning Research

MLPs are heavily used backbones in modern deep learning architectures for supervised learning on tabular data, and AdamW has been the go-to optimizer. Unlike architecture design, however, the choice of optimizer for tabular deep learning has not been examined systematically, despite new optimizers showing promise in other domains. This study fills that gap with comprehensive benchmarking.

Practical Recommendations for Practitioners

Based on the findings, practitioners working with tabular data should:

Consider Muon if accuracy is paramount and training time overhead is acceptable
Try EMA with AdamW on vanilla MLPs as a simple improvement
Base optimizer choice on specific model architecture, as effectiveness varies between vanilla MLPs and advanced variants

The research was authored by Yury Gorishniy, Ivan Rubachev, Dmitrii Feoktistov, and Artem Babenko. Code is available on GitHub at github.com/yandex-research/tabular-dl-optimizers, enabling practitioners to replicate results and apply the findings to their own tabular datasets.

Key Takeaways

Muon optimizer consistently outperforms AdamW across 17 tabular datasets for both plain MLPs and modern MLP-based architectures
The performance gain comes with slower training due to additional Newton-Schulz iteration overhead
Exponential moving average (EMA) of model weights improves AdamW on vanilla MLPs but shows less consistent benefits on advanced variants
This represents the first systematic benchmarking of optimizers specifically for tabular deep learning
Code is publicly available on GitHub for practitioners to replicate and apply findings

Muon Demonstrates Consistent Performance Gains

The researchers conducted their evaluation in the standard supervised learning setting under a shared experiment protocol, ensuring fair comparisons across all 15 optimizers tested.

Performance Comes With Training Efficiency Trade-offs

Exponential Moving Average Improves AdamW on Vanilla MLPs

Filling a Gap in Tabular Deep Learning Research

Practical Recommendations for Practitioners

Based on the findings, practitioners working with tabular data should:

Consider Muon if accuracy is paramount and training time overhead is acceptable

Try EMA with AdamW on vanilla MLPs as a simple improvement

Base optimizer choice on specific model architecture, as effectiveness varies between vanilla MLPs and advanced variants

Key Takeaways

Muon optimizer consistently outperforms AdamW across 17 tabular datasets for both plain MLPs and modern MLP-based architectures

The performance gain comes with slower training due to additional Newton-Schulz iteration overhead

Exponential moving average (EMA) of model weights improves AdamW on vanilla MLPs but shows less consistent benefits on advanced variants

This represents the first systematic benchmarking of optimizers specifically for tabular deep learning

Code is publicly available on GitHub for practitioners to replicate and apply findings