Generalization at the Edge of Stability

TL;DR

Introduces 'sharpness dimension' to explain improved generalization at the edge of stability.

cs.LG 🔴 Advanced 2026-04-22 40 views

Mario Tuci Caner Korkmaz Umut Şimşekli Tolga Birdal

Edge of Stability Chaotic Dynamics Generalization Performance Neural Networks Lyapunov Dimension

Key Findings

Methodology

This study represents stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set rather than a point. Based on this, we introduce a novel notion of dimension, called the 'sharpness dimension', and prove a generalization bound based on this dimension. Our results demonstrate that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work.

Key Results

Result 1: Experiments across various MLPs and transformers validate the theory and provide new insights into the recently observed phenomenon of grokking.
Result 2: By introducing the 'sharpness dimension', it is shown that generalization at the edge of stability is controlled by a provably lower-dimensional attractor.
Result 3: Experiments indicate that training dynamics exhibit chaotic behavior at the edge of stability, with training trajectories displaying sensitive dependence on initialization.

Significance

This study provides a new perspective on understanding the generalization performance of neural networks at the edge of stability by introducing the 'sharpness dimension'. It reveals that in the chaotic regime, generalization performance is not merely dependent on the properties of any single solution but rather on the geometric and characteristics of the entire solution set explored by the optimizer in the long term. This finding has significant implications for both academia and industry as it challenges traditional complexity measures and provides a theoretical foundation for understanding the generalization capabilities of overparameterized models.

Technical Contribution

Technical contributions include modeling stochastic optimizers as random dynamical systems, proposing the novel concept of 'sharpness dimension', and proving a generalization bound based on this dimension. Additionally, the study reveals the importance of the complete Hessian spectrum and the structure of its partial determinants in generalization, surpassing traditional analyses based on trace or spectral norm. This provides new theoretical guarantees and engineering possibilities for understanding the generalization capabilities of overparameterized models.

Novelty

This study is the first to model stochastic optimizers as random dynamical systems and introduces the novel concept of 'sharpness dimension'. Unlike previous research, this study reveals that generalization performance in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants rather than the traditional trace or spectral norm. This innovation provides a new perspective for understanding the generalization capabilities of neural networks at the edge of stability.

Limitations

Limitation 1: The method may face computational complexity issues when calculating the complete Hessian spectrum, especially when dealing with large-scale models.
Limitation 2: Although the generalization bound is theoretically proven, its applicability in practical applications may require further validation.
Limitation 3: The study primarily focuses on MLPs and transformers, and other types of neural networks may require additional research.

Future Work

Future research directions include: 1) validating the applicability of the 'sharpness dimension' on larger-scale neural networks; 2) exploring the generalization performance of other types of neural networks at the edge of stability; 3) developing more efficient algorithms to compute the complete Hessian spectrum to reduce computational complexity.

AI Executive Summary

In modern machine learning, understanding why large, overparameterized neural networks generalize remains a core problem. Traditional optimization theory suggests avoiding instability and chaotic behavior during training. However, recent studies show that neural networks exhibit improved generalization performance at the edge of stability.

This study proposes a novel approach by modeling stochastic optimizers as random dynamical systems and introducing the concept of 'sharpness dimension'. Through this method, the study reveals that in the chaotic regime, generalization performance is not merely dependent on the properties of any single solution but rather on the geometric and characteristics of the entire solution set explored by the optimizer in the long term.

Core technical principles include: 1) modeling stochastic optimizers as random dynamical systems; 2) defining the 'sharpness dimension' and its role in generalization; 3) analyzing the importance of the complete Hessian spectrum and the structure of its partial determinants in generalization. These principles provide a new perspective for understanding the generalization capabilities of overparameterized models.

Experimental results show that the theory is validated on MLPs and transformers and provide new insights into the recently observed phenomenon of grokking. Specific data indicate that generalization at the edge of stability is controlled by a provably lower-dimensional attractor.

This study has significant implications for both academia and industry as it challenges traditional complexity measures and provides a theoretical foundation for understanding the generalization capabilities of overparameterized models. However, the computational complexity of calculating the complete Hessian spectrum remains an issue, and future research will focus on larger-scale models and other types of neural networks.

Deep Analysis

Background

In recent years, with the rapid development of deep learning, understanding the generalization capabilities of neural networks has become an important research topic. Traditional optimization theory suggests avoiding instability and chaotic behavior during training to ensure the generalization capabilities of models. However, recent studies show that neural networks exhibit improved generalization performance at the edge of stability. This phenomenon has attracted widespread attention from researchers as it challenges traditional complexity measures and provides a new perspective for understanding the generalization capabilities of overparameterized models.

Core Problem

The core problem is how to explain the improved generalization performance of neural networks at the edge of stability. Traditional complexity measures, such as the trace or spectral norm of the Hessian, cannot capture the complexity of this phenomenon. Therefore, a new method is needed to understand generalization performance in the chaotic regime, which is important for improving the generalization capabilities of neural networks.

Innovation

The core innovations of this study include: 1) modeling stochastic optimizers as random dynamical systems, providing a new perspective for understanding generalization performance in the chaotic regime; 2) introducing the novel concept of 'sharpness dimension', which better explains generalization performance at the edge of stability; 3) revealing the importance of the complete Hessian spectrum and the structure of its partial determinants in generalization, surpassing traditional analyses based on trace or spectral norm.

Methodology

�� Model stochastic optimizers as random dynamical systems to study their dynamic behavior at the edge of stability.

�� Introduce the novel concept of 'sharpness dimension', defined through Lyapunov dimension theory, and prove a generalization bound based on this dimension.

�� Analyze the role of the complete Hessian spectrum and the structure of its partial determinants in generalization, emphasizing their complexity in the chaotic regime.

�� Validate the theory through experiments on MLPs and transformers and provide new insights into the grokking phenomenon.

Experiments

The experimental design includes validating the theory on MLPs and transformers. By using different learning rates and batch sizes, the study investigates generalization performance at the edge of stability. By calculating the complete Hessian spectrum and the structure of its partial determinants, the study analyzes their role in generalization. Experimental data indicate that generalization at the edge of stability is controlled by a provably lower-dimensional attractor.

Results

Applications

The application scenarios of this study include: 1) improving the generalization capabilities of neural networks, especially models at the edge of stability; 2) providing a theoretical foundation for understanding the generalization capabilities of overparameterized models; 3) helping develop more efficient deep learning models in the industry.

Limitations & Outlook

Although this study provides a new perspective for understanding the generalization capabilities of neural networks, it may face computational complexity issues when calculating the complete Hessian spectrum, especially when dealing with large-scale models. Additionally, although the generalization bound is theoretically proven, its applicability in practical applications may require further validation. Future research will focus on larger-scale models and other types of neural networks.

Plain Language Accessible to non-experts

Imagine you're in a complex maze where the walls keep changing. You need to find an exit without getting stuck in a dead end. Traditionally, you'd avoid unstable paths to ensure every step is safe. However, recent research suggests that sometimes walking on seemingly unstable paths can actually lead you to the exit faster. It's like finding a new way to navigate a chaotic maze.

In neural network training, traditional optimization methods are like carefully walking through the maze, avoiding any unstable paths. This study proposes a new method, like using the changing maze walls to find a better path. This method is called the 'sharpness dimension', helping us understand how to find a better exit in chaotic paths.

With this new method, we can better understand how neural networks perform in unstable states. It not only helps us find better solutions but also provides new directions for future research. Just like finding a new way to navigate a maze, we can find the exit faster and more efficiently.

ELI14 Explained like you're 14

Hey there! Did you know that when training neural networks, we usually want them to be like a well-behaved student, learning step by step without making mistakes? But sometimes, these networks are like mischievous kids, testing the limits and even getting a bit chaotic!

This research is like saying, hey, being mischievous can actually be good! When networks are at the edge of stability, they might learn better, just like how taking risks in a game can lead you to hidden treasures!

Researchers introduced a new concept called 'sharpness dimension' to help us understand how these networks perform in chaos. It's like finding a new way for mischievous kids to learn better through exploration!

So next time you see a mischievous kid, don't rush to criticize. Maybe they're learning in their own way! This study tells us that sometimes, chaos is a way of learning too!

Glossary

Edge of Stability

Refers to the state during neural network training where parameter updates are on the verge of instability. In this state, optimization dynamics exhibit oscillatory and chaotic behavior.

Used in the paper to describe the training state of neural networks at high learning rates.

Random Dynamical System

A mathematical model used to describe dynamical systems under random influences. It is often used to analyze the long-term behavior of complex systems.

Used to model the dynamic behavior of stochastic optimizers.

Sharpness Dimension

A novel concept of dimension used to measure generalization performance in the chaotic regime. Defined based on Lyapunov dimension theory.

Used to explain generalization performance at the edge of stability.

Lyapunov Dimension

A mathematical tool used to measure the degree of chaos in dynamical systems. It determines the complexity of the system by analyzing its Lyapunov exponents.

Used to define the sharpness dimension.

Hessian Spectrum

Refers to the set of eigenvalues of the Hessian matrix. It is used to describe the local curvature of the loss function.

Used to analyze the complexity of generalization performance.

Fractal Attractor

An attractor with a fractal structure in dynamical systems. It represents the set towards which the system tends in its long-term behavior.

Used to describe the convergence behavior of stochastic optimizers.

Grokking Phenomenon

Refers to the sudden improvement in generalization performance of neural networks after a long period of stability during training.

Used as an experimental phenomenon to validate the theory.

Spectral Norm

A norm of a matrix defined as the absolute value of its largest eigenvalue. Used to measure the size of the matrix.

Used in traditional complexity measures.

Trace

The sum of the diagonal elements of a matrix. Used to measure the overall size of the matrix.

Used in traditional complexity measures.

Multilayer Perceptron

A type of feedforward neural network composed of multiple layers, each consisting of multiple neurons.

Used as an experimental model to validate the theory.

Open Questions Unanswered questions from this research

1 How can the complete Hessian spectrum be efficiently computed for large-scale neural networks? Current methods face challenges in computational complexity, especially when dealing with large-scale models. This requires the development of more efficient algorithms to reduce computational complexity.
2 Does the improved generalization performance at the edge of stability also apply to other types of neural networks? Current research primarily focuses on MLPs and transformers, and other types of networks may require additional research.
3 How can the applicability of the 'sharpness dimension' be validated in practical applications? Although the generalization bound is theoretically proven, its applicability in practical applications may require further validation. This requires experiments in different application scenarios.
4 Is the improvement in generalization performance in the chaotic regime universal? Current research shows that improved generalization performance in the chaotic regime is effective in some cases, but whether it is universal remains to be further studied.
5 How can the 'sharpness dimension' be applied to optimization problems in other fields? This concept has been successfully applied in neural networks, but its application in other fields remains to be explored.

Applications

Immediate Applications

Improving Neural Network Generalization

By applying the 'sharpness dimension', the generalization capabilities of neural networks, especially large-scale models, can be improved at the edge of stability.

Optimizing Deep Learning Models

In the industry, this research can be used to develop more efficient deep learning models, improving model performance and stability.

Understanding Overparameterized Model Generalization

Provides a theoretical foundation for academia, helping researchers better understand the generalization capabilities of overparameterized models.

Long-term Vision

Developing More Efficient Optimization Algorithms

By further studying the 'sharpness dimension', more efficient optimization algorithms can be developed and applied to a wider range of fields.

Advancing Artificial Intelligence

This research provides a new theoretical foundation for the development of artificial intelligence, potentially driving further breakthroughs in AI technology in the future.

Abstract

Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

cs.LG cs.AI cs.CV stat.ML

References (20)

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeff Wu, R. Child et al.

2019 28137 citations ⭐ Influential

Hausdorff dimension, heavy tails, and generalization in neural networks

Umut Simsekli, Ozan Sener, George Deligiannidis et al.

2020 70 citations ⭐ Influential View Analysis →

Random attractors

H. Crauel, A. Debussche, F. Flandoli

1997 553 citations ⭐ Influential

Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms

R. Andreeva, Benjamin Dupuis, Rik Sarkar et al.

2024 10 citations ⭐ Influential View Analysis →

Uniform Generalization Bounds on Data-Dependent Hypothesis Sets via PAC-Bayesian Theory on Random Sets

Benjamin Dupuis, Paul Viallard, George Deligiannidis et al.

2024 7 citations ⭐ Influential View Analysis →

Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks

Tolga Birdal, Aaron Lou, L. Guibas et al.

2021 87 citations ⭐ Influential View Analysis →

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 32815 citations ⭐ Influential

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

N. Halko, P. Martinsson, J. Tropp

2009 2839 citations ⭐ Influential

Random Dynamical Systems

V. Araújo

2006 2077 citations ⭐ Influential View Analysis →

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Jeremy M. Cohen, Simran Kaur, Yuanzhi Li et al.

2021 384 citations ⭐ Influential View Analysis →

Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning

Andrew Ly, Pulin Gong

2025 19 citations ⭐ Influential

Approximating Spectral Densities of Large Matrices

Lin Lin, Y. Saad, Chao Yang

2013 173 citations ⭐ Influential View Analysis →

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

N. Keskar, Dheevatsa Mudigere, J. Nocedal et al.

2016 3378 citations View Analysis →

Adversarial Weight Perturbation Helps Robust Generalization

Dongxian Wu, Shutao Xia, Yisen Wang

2020 848 citations

Measure theory

Oliver Fest

2019 3113 citations

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

Xingyu Zhu, Zixuan Wang, Xiang Wang et al.

2022 58 citations View Analysis →

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harrison Edwards et al.

2022 574 citations View Analysis →

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, H. Mobahi et al.

2020 1832 citations View Analysis →

Generalisation under gradient descent via deterministic PAC-Bayes

Eugenio Clerico, Tyler Farghly, George Deligiannidis et al.

2022 7 citations View Analysis →

Unique Properties of Flat Minima in Deep Networks

Rotem Mulayoff, T. Michaeli

2020 42 citations

Generalization at the Edge of Stability

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Edge of Stability

Random Dynamical System

Sharpness Dimension

Lyapunov Dimension

Hessian Spectrum

Fractal Attractor

Grokking Phenomenon

Spectral Norm

Trace

Multilayer Perceptron

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Improving Neural Network Generalization

Optimizing Deep Learning Models

Understanding Overparameterized Model Generalization

Long-term Vision

Developing More Efficient Optimization Algorithms

Advancing Artificial Intelligence

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data