2025-11-21T08:19:15.669983

Convergence of optimizers implies eigenvalues filtering at equilibrium

Bolte, Le, Pauwels

Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.

academic

Convergence of optimizers implies eigenvalues filtering at equilibrium

Basic Information

Paper ID: 2510.09034
Title: Convergence of optimizers implies eigenvalues filtering at equilibrium
Authors: Jérôme Bolte, Quoc-Tung Le, Edouard Pauwels
Categories: cs.LG math.DS math.OC
Publication Date: October 13, 2025
Paper Link: https://arxiv.org/abs/2510.09034

Abstract

Extensive empirical evidence from deep neural network training demonstrates that various optimizers tend to find solutions near global optima. This paper adopts a reverse perspective, assuming convergence to arbitrary points rather than proving convergence, and focuses on the consequences of this assumption. From this viewpoint, combined with recent advances in marginal stability phenomena, the authors argue that different optimizers effectively function as eigenvalue filters determined by their hyperparameters. Specifically, standard gradient descent methods inherently avoid the sharpest minima, while Sharpness Aware Minimization (SAM) algorithms further actively prefer wider basins. Based on these insights, the authors propose two novel algorithms demonstrating enhanced eigenvalue filtering capabilities that effectively promote wider minima. The theoretical analysis leverages the generalized Hadamard-Perron stable manifold theorem, applicable to general semi-algebraic C² functions without requiring additional non-degeneracy conditions or global Lipschitz boundedness assumptions.

Research Background and Motivation

Core Problem

This research addresses the fundamental question of understanding convergence behavior of optimization algorithms in deep learning, particularly how they select specific minima in the complex loss landscape. While traditional research focuses on proving convergence, this paper adopts a "reverse" perspective: assuming convergence has occurred, analyzing what this convergence implies about the geometric properties (especially Hessian eigenvalues) of the reached point.

Significance

Connection between stability and generalization: Stable training correlates with wide attraction basins and flat minima, properties closely related to generalization performance
Marginal stability phenomena: Empirical observations suggest standard training typically operates near stability boundaries
Practical implications: Understanding optimizers' implicit preferences aids in designing better training algorithms

Limitations of Existing Approaches

Existing theory typically requires stringent assumptions (e.g., global Lipschitz bounds, non-degeneracy conditions)
Lack of unified framework for understanding eigenvalue filtering behavior across different optimizers
Limited theoretical understanding of SAM-class algorithms

Research Motivation

Over the past decade, successful training of deep learning models has become the norm, prompting a shift in research perspective from "when does convergence occur" to "why does convergence succeed and how do hyperparameters enable it."

Core Contributions

Unified theoretical framework: Proposes a unified analysis framework based on the generalized Hadamard-Perron stable manifold theorem, applicable to a broad class of optimization algorithms
Eigenvalue filtering theory: Proves that successful convergence of optimizers necessarily imposes constraints on Hessian eigenvalues at reached points, creating an "eigenvalue filtering" effect
Algorithm analysis: Systematically analyzes eigenvalue filtering properties of gradient descent, heavy ball method, Nesterov accelerated gradient, and USAM
Novel algorithms: Designs Two-step USAM and Hessian USAM, demonstrating stronger eigenvalue filtering capabilities
Theoretical extension: Extends existing results to more general semi-algebraic function classes, removing abstract non-degeneracy assumptions

Methodology Details

Problem Formulation

Consider the general form of iterative optimization algorithms: $x_{k+1} = G_\alpha(x_k) = Dx_k - \alpha g(x_k), \quad k = 0, 1, 2, \ldots$

where:

$D \in \mathbb{R}^{m \times m}$ is an invertible matrix
$g: \mathbb{R}^m \to \mathbb{R}^m$ is a continuously differentiable semi-algebraic mapping
$\alpha > 0$ is the step size parameter

Core Theoretical Results

Main Theorem (Eigenvalue Filtering)

Theorem 1.1: Let $D \in \mathbb{R}^{m \times m}$ be an invertible matrix and $g: \mathbb{R}^m \to \mathbb{R}^m$ be a $C^1$ semi-algebraic mapping. For almost all $x_0 \in \mathbb{R}^m$ and $\alpha > 0$ , if the sequence $(x_k)_{k \in \mathbb{N}}$ converges to some point $\bar{x}$ , then the spectral radius of the Jacobian of $D - \alpha g$ at $\bar{x}$ is at most 1: $\rho(\text{Jac}G_\alpha(\bar{x})) \leq 1$

Extended Stable Manifold Theorem

Theorem 2.1: There exists $\Lambda \subset \mathbb{R}_+$ whose complement is finite, such that for any $\alpha \in \Lambda$ , the set $W_\alpha = \{x_0 \in \mathbb{R}^m | \exists \bar{x} \text{ s.t. } G_\alpha(\bar{x}) = \bar{x}, \rho(\text{Jac}G_\alpha(\bar{x})) > 1, x_k \to \bar{x}\}$ is contained in a countable union of $C^1$ submanifolds of dimension at most $m-1$ .

Technical Innovations

Semi-algebraic assumption: Uses semi-algebraic function class as sufficient condition, encompassing virtually all common functions in deep learning
No global conditions required: Does not require global Lipschitz bounds or non-degeneracy assumptions
Unified analysis framework: Through unified matrix form $D$ and mapping $g$ , covers multiple optimization algorithms

Specific Algorithm Analysis

Gradient Descent

Proposition 3.1: For gradient descent $x_{k+1} = x_k - \alpha \nabla f(x_k)$ , if convergent to $\bar{x}$ , all eigenvalues $\lambda$ of $\nabla^2f(\bar{x})$ satisfy: $0 \leq \lambda \leq \frac{2}{\alpha}$

Heavy Ball Method

Proposition 3.2: For the heavy ball method, eigenvalue constraints are: $0 \leq \lambda \leq \frac{2(1+\beta)}{\alpha}$

USAM Algorithm

Proposition 3.4: For USAM algorithm $x_{k+1} = x_k - \alpha \nabla f(x_k + \rho \nabla f(x_k))$ , eigenvalues $\lambda$ satisfy: $0 \leq \lambda(1 + \rho\lambda) \leq \frac{2(1+\beta)}{\alpha}$

Equivalently: $0 \leq \lambda \leq \frac{\sqrt{1 + 8(1+\beta)\rho/\alpha} - 1}{2\rho}$

Novel Algorithm Designs

Two-step USAM

Update rule: $x_{k+1} = x_k - \alpha \nabla f(x_k + \rho \nabla f(x_k) + \rho \nabla f(x_k + \rho \nabla f(x_k)))$

Eigenvalue constraints: $0 \leq \lambda(1 + \rho\lambda)^2 \leq \frac{2(1+\beta)}{\alpha}$

Hessian USAM

Update rule: $x_{k+1} = x_k - \alpha \nabla f(x_k + \rho \nabla^2f(x_k)\nabla f(x_k))$

Eigenvalue constraints: $0 \leq \lambda(1 + \rho\lambda^2) \leq \frac{2(1+\beta)}{\alpha}$

Experimental Setup

Datasets

MNIST + MLP: Hidden layer dimensions {128, 64, 10, 10}, ReLU activation, cross-entropy loss
Fashion-MNIST + MLP: Same configuration as above
CIFAR10 + WideResNet-16-8: WideResNet architecture without batch normalization layers

Experimental Configuration

Batch size: 128
Learning rate: $\alpha = 0.01$
Weight decay: $5 \times 10^{-4}$
Momentum: $\beta \in \{0, 0.9\}$
SAM parameter: $\rho$ selected via grid search

Evaluation Metrics

Test accuracy
Top three largest eigenvalues of Hessian matrix

Experimental Results

Main Findings

Eigenvalue filtering verification: Experimental results align highly with theoretical predictions; USAM, Two-step USAM, and Hessian USAM indeed find flatter minima
Algorithm comparison:
- Standard gradient descent: baseline performance
- USAM: significantly reduces Hessian eigenvalues
- Two-step USAM: further improves eigenvalue filtering
- Hessian USAM: similar improvement effects
Architecture dependency:
- MLP architecture: high alignment between theoretical predictions and experimental results
- WideResNet: smaller discrepancies, possibly due to increased training difficulty

Experimental Observations

Stability requirements: Two-step USAM and Hessian USAM require smaller $\rho$ values to avoid training failure, consistent with theoretical predictions of stricter curvature constraints
Batch normalization effects: In architectures using batch normalization, the flattening effects of SAM-class algorithms are less pronounced, which does not contradict theory as batch normalization alters algorithm dynamics

Stable Manifold Theorems

Classical results by Hadamard (1901), Perron (1929)
Applications in modern optimization: Lee et al. (2016), Panageas & Piliouras (2017), Ahn et al. (2022)

Marginal Stability Phenomena

Cohen et al. (2021, 2022): marginal stability of gradient descent and adaptive methods
Andreyev & Beneventano (2024): extensions to stochastic algorithms

Sharpness Aware Minimization

Foret et al. (2021): original SAM algorithm
Andriushchenko & Flammarion (2022): USAM variants
Subsequent theoretical analyses: Zhou et al. (2025), Marion & Chizat (2024)

Conclusions and Discussion

Main Conclusions

Unified perspective: Successful optimizer training is essentially an eigenvalue filtering process, with different algorithms achieving varying degrees of filtering through hyperparameters
Theoretical extension: The generalized stable manifold theorem provides a powerful theoretical tool for understanding optimization algorithms
Practical guidance: Theoretical results provide principled guidance for designing new optimization algorithms

Limitations

Semi-algebraic assumption: While broadly applicable, still has certain limitations
Computational cost of new algorithms: Two-step USAM and Hessian USAM incur higher per-iteration computational costs
Batch normalization compatibility: The theoretical framework has not yet covered batch normalization operations

Future Directions

Extension to more general function classes: Explore theoretical extensions without semi-algebraic assumptions
Batch normalization theory: Extend the theoretical framework to architectures with batch normalization
Practical algorithm optimization: Reduce computational costs of new algorithms while maintaining theoretical advantages

In-Depth Evaluation

Strengths

Theoretical innovation: Provides a novel perspective for understanding optimization algorithms, shifting from "convergence proofs" to "convergence consequence analysis"
Unified framework: First to provide a unified theoretical framework for analyzing eigenvalue filtering behavior across multiple optimization algorithms
Practical value: Theoretical results directly guide new algorithm design and are validated experimentally
Technical rigor: Mathematical derivations are rigorous with clearly stated and reasonable assumptions

Weaknesses

Limited experimental scale: Experiments primarily conducted on relatively simple architectures and datasets; large-scale experimental validation is insufficient
New algorithm evaluation: Comprehensive performance evaluation (including generalization ability) of Two-step USAM and Hessian USAM requires further work
Theory-practice gap: Actual performance of SAM algorithms shows some discrepancy with theoretical predictions (e.g., strict saddle point issues)

Impact

Theoretical contribution: Provides new analytical tools and perspectives for optimization theory
Practical value: Offers principled guidance for optimization algorithm design
Cross-disciplinary significance: Connects dynamical systems theory with machine learning practice

Applicable Scenarios

Deep learning optimization: Particularly suitable for understanding and improving neural network training algorithms
Non-convex optimization: Provides new analytical tools for general non-convex optimization problems
Algorithm design: Guides design and analysis of novel optimization algorithms

References

This paper cites extensive related work, primarily including:

Classical dynamical systems theory literature
Modern optimization theory advances
Stability and generalization research in deep learning
Works on sharpness aware minimization
Theoretical and experimental studies on marginal stability phenomena

Overall Assessment: This is an excellent paper combining theoretical depth with practical value, providing new theoretical tools for understanding optimization phenomena in deep learning and demonstrating successful cases of theory-guided algorithm design. While there is room for improvement in large-scale experimental validation, its theoretical contributions and innovative perspective constitute important progress in optimization theory.