Benchmarking Optimizers for MLPs in Tabular Deep Learning

TL;DR

Muon optimizer outperforms AdamW in MLP-based tabular deep learning, recommended if training efficiency is acceptable.

cs.LG 🔴 Advanced 2026-04-17 35 views

Yury Gorishniy Ivan Rubachev Dmitrii Feoktistov Artem Babenko

optimizers MLP tabular data deep learning benchmarking

Key Findings

Methodology

The study systematically compares 15 optimizers across 17 tabular datasets under a unified hyperparameter tuning and evaluation protocol. Special focus is given to the Muon optimizer, which has shown strong performance in various domains, including LLM training and information retrieval. Experiments involved standard ReLU MLPs and more advanced MLP variants like MLP† and TabM.

Key Results

Muon consistently outperforms AdamW across all MLP variants, with an average score improvement of 0.32. Specifically, on 17 datasets, Muon shows superior performance in most cases, particularly in complex models like TabM†.
The use of Exponential Moving Average (EMA) can further enhance AdamW's performance in some cases, but its effect is less pronounced in complex models compared to Muon.
Muon is approximately 1.03 times slower than AdamW in training but offers significant predictive performance improvements, especially in noisy, finite-data environments.

Significance

This research fills a gap in systematic studies on optimizer choice in tabular deep learning, providing a comprehensive benchmark. The superior performance of the Muon optimizer offers a new option for researchers and practitioners, particularly in scenarios requiring high generalization capabilities. The study also highlights the potential value of EMA in simple MLPs, though its effects are inconsistent in complex models.

Technical Contribution

The technical contribution lies in the first systematic evaluation of various optimizers on tabular data, particularly highlighting the superiority of the Muon optimizer. The study provides new experimental data and analysis, demonstrating Muon's consistent performance across various MLP architectures and exploring the potential application of EMA.

Novelty

This is the first study to systematically compare multiple optimizers in tabular deep learning, particularly introducing the Muon optimizer to this field. Unlike previous studies that focused mainly on architecture design, this paper emphasizes the impact of optimizer choice on model performance.

Limitations

The study is limited to MLP architectures and does not cover other deep learning models like CNNs or GNNs, which limits the generalizability of the results.
Muon's slower training speed may not be suitable for scenarios with limited computational resources.
The study does not delve into the performance differences of optimizers on specific datasets, requiring more detailed analysis in the future.

Future Work

Future research could extend to other types of deep learning models, such as CNNs and GNNs, to explore the potential applications of the Muon optimizer in different domains. Additionally, optimizing Muon's training efficiency could make it more competitive in resource-constrained environments.

AI Executive Summary

In modern deep learning, Multi-Layer Perceptrons (MLPs) are a crucial architecture for supervised learning on tabular data, with AdamW being the default optimizer for training these models. However, despite the emergence of promising new optimizers in other domains, the choice of optimizer in tabular deep learning has not been systematically studied.

This paper addresses this gap by benchmarking 15 optimizers across 17 tabular datasets. The study finds that the Muon optimizer consistently outperforms AdamW across all MLP variants, particularly in complex models like TabM†. Although Muon is slower in training, its improvements in predictive performance make it a strong choice.

The research also explores the role of Exponential Moving Average (EMA) in enhancing AdamW's performance. While EMA shows potential in simple MLPs, its effects are inconsistent in complex models, indicating that its application needs to be tailored to specific models.

The significance of this study lies in providing a comprehensive guide for optimizer selection in tabular deep learning, particularly introducing the Muon optimizer as a new tool for researchers and practitioners. The study also highlights the impact of optimizer choice on model generalization capabilities in noisy, finite-data environments.

However, the study has limitations. It is restricted to MLP architectures and does not cover other deep learning models. Additionally, Muon's slower training speed may not be suitable for scenarios with limited computational resources. Future research could extend to other types of deep learning models and explore the potential applications of the Muon optimizer in different domains.

Deep Analysis

Background

In deep learning, handling tabular data has been a significant research area. Multi-Layer Perceptrons (MLPs) serve as a foundational architecture widely used for supervised learning tasks on tabular data. While significant progress has been made in architectural design, optimizer choice remains largely dependent on AdamW, lacking systematic study. With new optimizers showing success in other domains, revisiting optimizer choice in tabular data becomes crucial.

Core Problem

The lack of systematic research on optimizer choice in tabular deep learning may lead to missed opportunities for performance improvement. While AdamW is the default choice, its optimality in tabular data remains unverified. Particularly in noisy, finite-data environments, optimizer choice significantly impacts model generalization capabilities.

Innovation

The innovations in this paper include:

1) A systematic evaluation of 15 optimizers on tabular data, filling a research gap in this area.

2) Introduction of the Muon optimizer, demonstrating its superiority across various MLP architectures.

3) Exploration of Exponential Moving Average (EMA) in enhancing AdamW's performance, providing new experimental data and analysis.

Methodology

The research methodology includes:

�� Selection of 15 optimizers, including Muon, AdamW, and its variants.
�� Experiments conducted on 17 tabular datasets, covering different task types and data scales.
�� Unified hyperparameter tuning and evaluation protocol to ensure comparability of results.
�� Independent tuning for each optimizer to ensure fair comparison.
�� Cross-validation and multiple experiments to ensure robustness of results.

Experiments

The experimental design includes:

�� Datasets: 17 tabular datasets, including standard academic and industrial datasets.
�� Models: Standard ReLU MLPs and more complex MLP variants like MLP† and TabM.
�� Evaluation metrics: Accuracy for classification tasks, RMSE for regression tasks.
�� Hyperparameter tuning: Conducted using Optuna, with independent tuning for each optimizer within their respective search spaces.

Results

Results analysis shows:

�� Muon consistently outperforms AdamW across all MLP variants, particularly in complex models like TabM†.
�� EMA can further enhance AdamW's performance in some cases, but its effect is less pronounced in complex models compared to Muon.
�� Muon is approximately 1.03 times slower than AdamW in training but offers significant predictive performance improvements, especially in noisy, finite-data environments.

Applications

Application scenarios include:

�� In tabular data tasks requiring high generalization capabilities, the Muon optimizer can significantly enhance model performance.
�� In noisy, finite-data environments, optimizer choice significantly impacts model generalization capabilities.
�� EMA's potential application in simple MLPs, suitable for scenarios requiring quick performance improvements.

Limitations & Outlook

Limitations & outlook:

�� The study is limited to MLP architectures and does not cover other deep learning models like CNNs or GNNs.
�� Muon's slower training speed may not be suitable for scenarios with limited computational resources.
�� Future research could extend to other types of deep learning models, exploring the potential applications of the Muon optimizer in different domains.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. You have a recipe (MLP model) and need to choose the right kitchen tools (optimizers) to complete the dish. AdamW is like your go-to pan, reliable but not the best for every dish. The Muon optimizer is like a new multi-functional pan, a bit more complex to use but can make the dish taste better. The study found that Muon performs better when handling complex dishes (complex models), while for simple dishes (simple models), using some tricks (like EMA) can also improve performance. Although Muon might take more time to use, it's definitely worth trying if you want to make a sumptuous dinner.

ELI14 Explained like you're 14

Hey there! Did you know scientists are working on making computers smarter, especially when it comes to handling tables of data? It's like using different notebooks for different subjects at school. Scientists are looking for the best tool to train computers. AdamW is their usual tool, like your favorite notebook. But they've found a new tool called Muon, like a super notebook that helps computers learn better! However, this new tool is a bit slow, like writing slowly but neatly. Scientists also found some tricks to make the old tool work better. In the future, they'll keep researching these tools to see if they can make computers learn both quickly and well!

Glossary

Optimizer

An algorithm used to adjust model parameters to minimize the loss function. In this paper, optimizers are used to train MLP models.

The study compares 15 optimizers on tabular data performance.

MLP (Multi-Layer Perceptron)

A neural network architecture consisting of multiple fully connected layers, commonly used for processing tabular data.

MLP is the primary model architecture studied in this paper.

AdamW

A widely used optimizer that combines Adam optimizer with weight decay regularization.

AdamW is the default optimizer in tabular deep learning.

Muon

A novel optimizer that has shown strong performance in various domains.

The study finds Muon outperforms AdamW on tabular data.

EMA (Exponential Moving Average)

A technique that improves model performance by averaging model weights over time.

The study explores EMA's role in enhancing AdamW's performance.

TabM

A complex MLP-based model using parameter-efficient ensembling techniques.

TabM is one of the complex models evaluated in the study.

Optuna

A framework for hyperparameter optimization using Bayesian optimization techniques.

The study uses Optuna for hyperparameter tuning of optimizers.

Cross-Entropy Loss

A loss function used for classification tasks, measuring the difference between predicted and true probability distributions.

The study uses cross-entropy loss for classification task training.

RMSE (Root Mean Square Error)

An evaluation metric for regression tasks, measuring the difference between predicted and true values.

The study uses RMSE as an evaluation metric for regression tasks.

Hyperparameter Tuning

The process of adjusting model or optimizer parameters to optimize performance.

The study conducts independent hyperparameter tuning for each optimizer.

Open Questions Unanswered questions from this research

1 While the Muon optimizer shows superior performance on tabular data, its performance on other types of deep learning models remains unverified. Future research could extend to CNNs and GNNs to explore Muon's potential applications in more domains.
2 The study does not delve into the performance differences of optimizers on specific datasets. More detailed analysis is needed in the future to understand the strengths and weaknesses of different optimizers under varying data characteristics.
3 Although the study explores EMA's role in enhancing AdamW's performance, its effects are inconsistent in complex models. Future research could further explore EMA's potential applications across different model architectures.
4 The study does not address the comparison of optimizers in terms of training efficiency and resource consumption. Future research could optimize Muon's training efficiency to make it more competitive in resource-constrained environments.
5 While the study provides rich experimental data, it lacks theoretical analysis. Future research could explore why Muon performs well on tabular data from a theoretical perspective.

Applications

Immediate Applications

Tabular Data Analysis

The Muon optimizer can be used to enhance model performance in tabular data analysis tasks, particularly in scenarios requiring high generalization capabilities.

Financial Data Prediction

In financial data prediction, using the Muon optimizer can improve model accuracy, aiding financial institutions in better risk management.

Medical Data Analysis

In medical data analysis, the Muon optimizer can help improve diagnostic accuracy, providing more reliable support for medical decision-making.

Long-term Vision

Intelligent Decision Systems

The superior performance of the Muon optimizer can be applied to build more intelligent decision systems, enhancing the accuracy and efficiency of automated decisions.

Cross-Domain Applications

With further research on the Muon optimizer, its applications could expand to more domains like image recognition and natural language processing, driving advancements in these fields.

Abstract

MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go-to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimizer for tabular DL has not been examined systematically, despite new optimizers showing promise in other domains. To fill this gap, we benchmark \Noptimizers optimizers on \Ndatasets tabular datasets for training MLP-based models in the standard supervised learning setting under a shared experiment protocol. Our main finding is that the Muon optimizer consistently outperforms AdamW, and thus should be considered a strong and practical choice for practitioners and researchers, if the associated training efficiency overhead is affordable. Additionally, we find exponential moving average of model weights to be a simple yet effective technique that improves AdamW on vanilla MLPs, though its effect is less consistent across model variants.

cs.LG

References (20)

Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

Priya Kasimbeg, Frank Schneider, Runa Eschenhagen et al.

2025 22 citations ⭐ Influential View Analysis →

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

Yu. V. Gorishniy, Akim Kotelnikov, Artem Babenko

2024 73 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 32696 citations ⭐ Influential

Unveiling the Role of Data Uncertainty in Tabular Deep Learning

Nikolay Kartashev, Ivan Rubachev, Artem Babenko

2025 1 citations ⭐ Influential View Analysis →

The Road Less Scheduled

Aaron Defazio, Xingyu Yang, Harsh Mehta et al.

2024 144 citations ⭐ Influential View Analysis →

signSGD: compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, K. Azizzadenesheli et al.

2018 1237 citations View Analysis →

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Xingyu Xie, Pan Zhou, Huan Li et al.

2022 277 citations View Analysis →

Symbolic Discovery of Optimization Algorithms

Xiangning Chen, Chen Liang, Da Huang et al.

2023 588 citations View Analysis →

The AdEMAMix Optimizer: Better, Faster, Older

Matteo Pagliardini, Pierre Ablin, David Grangier

2024 32 citations View Analysis →

Scikit-learn: Machine Learning in Python

Fabian Pedregosa, G. Varoquaux, Alexandre Gramfort et al.

2011 87884 citations View Analysis →

SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling

Nikhil Vyas, Depen Morwani, Rosie Zhao et al.

2025 43 citations

Benchmarking Optimizers for Large Language Model Pretraining

Andrei Semenov, Matteo Pagliardini, Martin Jaggi

2025 29 citations View Analysis →

Tabular Data: Is Deep Learning all you need?

Guri Zabergja, A. Kadra, Christian M. M. Frey et al.

2024 5 citations View Analysis →

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, T. Garipov et al.

2018 1959 citations View Analysis →

Analyzing and Improving the Training Dynamics of Diffusion Models

T. Karras, M. Aittala, J. Lehtinen et al.

2023 383 citations View Analysis →

Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later

Han-Jia Ye, Huai-Hong Yin, De-chuan Zhan et al.

2024 20 citations View Analysis →

TabR: Tabular Deep Learning Meets Nearest Neighbors

Yu. V. Gorishniy, Ivan Rubachev, Nikolay Kartashev et al.

2023 75 citations View Analysis →

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev et al.

2025 64 citations View Analysis →

Incorporating Nesterov Momentum into Adam

Timothy Dozat

2016 2024 citations

Optimizing Rank for High-Fidelity Implicit Neural Representations

Julian McGinnis, Florian A. Hölzl, Suprosanna Shit et al.

2025 3 citations View Analysis →

Benchmarking Optimizers for MLPs in Tabular Deep Learning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Optimizer

MLP (Multi-Layer Perceptron)

AdamW

Muon

EMA (Exponential Moving Average)

TabM

Optuna

Cross-Entropy Loss

RMSE (Root Mean Square Error)

Hyperparameter Tuning

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Tabular Data Analysis

Financial Data Prediction

Medical Data Analysis

Long-term Vision

Intelligent Decision Systems

Cross-Domain Applications

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data