2025-11-21T08:19:15.669983

Convergence of optimizers implies eigenvalues filtering at equilibrium

Bolte, Le, Pauwels
Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.
academic

Convergence of optimizers implies eigenvalues filtering at equilibrium

Basic Information

  • Paper ID: 2510.09034
  • Title: Convergence of optimizers implies eigenvalues filtering at equilibrium
  • Authors: Jérôme Bolte, Quoc-Tung Le, Edouard Pauwels
  • Categories: cs.LG math.DS math.OC
  • Publication Date: October 13, 2025
  • Paper Link: https://arxiv.org/abs/2510.09034

Abstract

Extensive empirical evidence from deep neural network training demonstrates that various optimizers tend to find solutions near global optima. This paper adopts a reverse perspective, assuming convergence to arbitrary points rather than proving convergence, and focuses on the consequences of this assumption. From this viewpoint, combined with recent advances in marginal stability phenomena, the authors argue that different optimizers effectively function as eigenvalue filters determined by their hyperparameters. Specifically, standard gradient descent methods inherently avoid the sharpest minima, while Sharpness Aware Minimization (SAM) algorithms further actively prefer wider basins. Based on these insights, the authors propose two novel algorithms demonstrating enhanced eigenvalue filtering capabilities that effectively promote wider minima. The theoretical analysis leverages the generalized Hadamard-Perron stable manifold theorem, applicable to general semi-algebraic C² functions without requiring additional non-degeneracy conditions or global Lipschitz boundedness assumptions.

Research Background and Motivation

Core Problem

This research addresses the fundamental question of understanding convergence behavior of optimization algorithms in deep learning, particularly how they select specific minima in the complex loss landscape. While traditional research focuses on proving convergence, this paper adopts a "reverse" perspective: assuming convergence has occurred, analyzing what this convergence implies about the geometric properties (especially Hessian eigenvalues) of the reached point.

Significance

  1. Connection between stability and generalization: Stable training correlates with wide attraction basins and flat minima, properties closely related to generalization performance
  2. Marginal stability phenomena: Empirical observations suggest standard training typically operates near stability boundaries
  3. Practical implications: Understanding optimizers' implicit preferences aids in designing better training algorithms

Limitations of Existing Approaches

  • Existing theory typically requires stringent assumptions (e.g., global Lipschitz bounds, non-degeneracy conditions)
  • Lack of unified framework for understanding eigenvalue filtering behavior across different optimizers
  • Limited theoretical understanding of SAM-class algorithms

Research Motivation

Over the past decade, successful training of deep learning models has become the norm, prompting a shift in research perspective from "when does convergence occur" to "why does convergence succeed and how do hyperparameters enable it."

Core Contributions

  1. Unified theoretical framework: Proposes a unified analysis framework based on the generalized Hadamard-Perron stable manifold theorem, applicable to a broad class of optimization algorithms
  2. Eigenvalue filtering theory: Proves that successful convergence of optimizers necessarily imposes constraints on Hessian eigenvalues at reached points, creating an "eigenvalue filtering" effect
  3. Algorithm analysis: Systematically analyzes eigenvalue filtering properties of gradient descent, heavy ball method, Nesterov accelerated gradient, and USAM
  4. Novel algorithms: Designs Two-step USAM and Hessian USAM, demonstrating stronger eigenvalue filtering capabilities
  5. Theoretical extension: Extends existing results to more general semi-algebraic function classes, removing abstract non-degeneracy assumptions

Methodology Details

Problem Formulation

Consider the general form of iterative optimization algorithms: xk+1=Gα(xk)=Dxkαg(xk),k=0,1,2,x_{k+1} = G_\alpha(x_k) = Dx_k - \alpha g(x_k), \quad k = 0, 1, 2, \ldots

where:

  • DRm×mD \in \mathbb{R}^{m \times m} is an invertible matrix
  • g:RmRmg: \mathbb{R}^m \to \mathbb{R}^m is a continuously differentiable semi-algebraic mapping
  • α>0\alpha > 0 is the step size parameter

Core Theoretical Results

Main Theorem (Eigenvalue Filtering)

Theorem 1.1: Let DRm×mD \in \mathbb{R}^{m \times m} be an invertible matrix and g:RmRmg: \mathbb{R}^m \to \mathbb{R}^m be a C1C^1 semi-algebraic mapping. For almost all x0Rmx_0 \in \mathbb{R}^m and α>0\alpha > 0, if the sequence (xk)kN(x_k)_{k \in \mathbb{N}} converges to some point xˉ\bar{x}, then the spectral radius of the Jacobian of DαgD - \alpha g at xˉ\bar{x} is at most 1: ρ(JacGα(xˉ))1\rho(\text{Jac}G_\alpha(\bar{x})) \leq 1

Extended Stable Manifold Theorem

Theorem 2.1: There exists ΛR+\Lambda \subset \mathbb{R}_+ whose complement is finite, such that for any αΛ\alpha \in \Lambda, the set Wα={x0Rmxˉ s.t. Gα(xˉ)=xˉ,ρ(JacGα(xˉ))>1,xkxˉ}W_\alpha = \{x_0 \in \mathbb{R}^m | \exists \bar{x} \text{ s.t. } G_\alpha(\bar{x}) = \bar{x}, \rho(\text{Jac}G_\alpha(\bar{x})) > 1, x_k \to \bar{x}\} is contained in a countable union of C1C^1 submanifolds of dimension at most m1m-1.

Technical Innovations

  1. Semi-algebraic assumption: Uses semi-algebraic function class as sufficient condition, encompassing virtually all common functions in deep learning
  2. No global conditions required: Does not require global Lipschitz bounds or non-degeneracy assumptions
  3. Unified analysis framework: Through unified matrix form DD and mapping gg, covers multiple optimization algorithms

Specific Algorithm Analysis

Gradient Descent

Proposition 3.1: For gradient descent xk+1=xkαf(xk)x_{k+1} = x_k - \alpha \nabla f(x_k), if convergent to xˉ\bar{x}, all eigenvalues λ\lambda of 2f(xˉ)\nabla^2f(\bar{x}) satisfy: 0λ2α0 \leq \lambda \leq \frac{2}{\alpha}

Heavy Ball Method

Proposition 3.2: For the heavy ball method, eigenvalue constraints are: 0λ2(1+β)α0 \leq \lambda \leq \frac{2(1+\beta)}{\alpha}

USAM Algorithm

Proposition 3.4: For USAM algorithm xk+1=xkαf(xk+ρf(xk))x_{k+1} = x_k - \alpha \nabla f(x_k + \rho \nabla f(x_k)), eigenvalues λ\lambda satisfy: 0λ(1+ρλ)2(1+β)α0 \leq \lambda(1 + \rho\lambda) \leq \frac{2(1+\beta)}{\alpha}

Equivalently: 0λ1+8(1+β)ρ/α12ρ0 \leq \lambda \leq \frac{\sqrt{1 + 8(1+\beta)\rho/\alpha} - 1}{2\rho}

Novel Algorithm Designs

Two-step USAM

Update rule: xk+1=xkαf(xk+ρf(xk)+ρf(xk+ρf(xk)))x_{k+1} = x_k - \alpha \nabla f(x_k + \rho \nabla f(x_k) + \rho \nabla f(x_k + \rho \nabla f(x_k)))

Eigenvalue constraints: 0λ(1+ρλ)22(1+β)α0 \leq \lambda(1 + \rho\lambda)^2 \leq \frac{2(1+\beta)}{\alpha}

Hessian USAM

Update rule: xk+1=xkαf(xk+ρ2f(xk)f(xk))x_{k+1} = x_k - \alpha \nabla f(x_k + \rho \nabla^2f(x_k)\nabla f(x_k))

Eigenvalue constraints: 0λ(1+ρλ2)2(1+β)α0 \leq \lambda(1 + \rho\lambda^2) \leq \frac{2(1+\beta)}{\alpha}

Experimental Setup

Datasets

  1. MNIST + MLP: Hidden layer dimensions {128, 64, 10, 10}, ReLU activation, cross-entropy loss
  2. Fashion-MNIST + MLP: Same configuration as above
  3. CIFAR10 + WideResNet-16-8: WideResNet architecture without batch normalization layers

Experimental Configuration

  • Batch size: 128
  • Learning rate: α=0.01\alpha = 0.01
  • Weight decay: 5×1045 \times 10^{-4}
  • Momentum: β{0,0.9}\beta \in \{0, 0.9\}
  • SAM parameter: ρ\rho selected via grid search

Evaluation Metrics

  • Test accuracy
  • Top three largest eigenvalues of Hessian matrix

Experimental Results

Main Findings

  1. Eigenvalue filtering verification: Experimental results align highly with theoretical predictions; USAM, Two-step USAM, and Hessian USAM indeed find flatter minima
  2. Algorithm comparison:
    • Standard gradient descent: baseline performance
    • USAM: significantly reduces Hessian eigenvalues
    • Two-step USAM: further improves eigenvalue filtering
    • Hessian USAM: similar improvement effects
  3. Architecture dependency:
    • MLP architecture: high alignment between theoretical predictions and experimental results
    • WideResNet: smaller discrepancies, possibly due to increased training difficulty

Experimental Observations

  1. Stability requirements: Two-step USAM and Hessian USAM require smaller ρ\rho values to avoid training failure, consistent with theoretical predictions of stricter curvature constraints
  2. Batch normalization effects: In architectures using batch normalization, the flattening effects of SAM-class algorithms are less pronounced, which does not contradict theory as batch normalization alters algorithm dynamics

Stable Manifold Theorems

  • Classical results by Hadamard (1901), Perron (1929)
  • Applications in modern optimization: Lee et al. (2016), Panageas & Piliouras (2017), Ahn et al. (2022)

Marginal Stability Phenomena

  • Cohen et al. (2021, 2022): marginal stability of gradient descent and adaptive methods
  • Andreyev & Beneventano (2024): extensions to stochastic algorithms

Sharpness Aware Minimization

  • Foret et al. (2021): original SAM algorithm
  • Andriushchenko & Flammarion (2022): USAM variants
  • Subsequent theoretical analyses: Zhou et al. (2025), Marion & Chizat (2024)

Conclusions and Discussion

Main Conclusions

  1. Unified perspective: Successful optimizer training is essentially an eigenvalue filtering process, with different algorithms achieving varying degrees of filtering through hyperparameters
  2. Theoretical extension: The generalized stable manifold theorem provides a powerful theoretical tool for understanding optimization algorithms
  3. Practical guidance: Theoretical results provide principled guidance for designing new optimization algorithms

Limitations

  1. Semi-algebraic assumption: While broadly applicable, still has certain limitations
  2. Computational cost of new algorithms: Two-step USAM and Hessian USAM incur higher per-iteration computational costs
  3. Batch normalization compatibility: The theoretical framework has not yet covered batch normalization operations

Future Directions

  1. Extension to more general function classes: Explore theoretical extensions without semi-algebraic assumptions
  2. Batch normalization theory: Extend the theoretical framework to architectures with batch normalization
  3. Practical algorithm optimization: Reduce computational costs of new algorithms while maintaining theoretical advantages

In-Depth Evaluation

Strengths

  1. Theoretical innovation: Provides a novel perspective for understanding optimization algorithms, shifting from "convergence proofs" to "convergence consequence analysis"
  2. Unified framework: First to provide a unified theoretical framework for analyzing eigenvalue filtering behavior across multiple optimization algorithms
  3. Practical value: Theoretical results directly guide new algorithm design and are validated experimentally
  4. Technical rigor: Mathematical derivations are rigorous with clearly stated and reasonable assumptions

Weaknesses

  1. Limited experimental scale: Experiments primarily conducted on relatively simple architectures and datasets; large-scale experimental validation is insufficient
  2. New algorithm evaluation: Comprehensive performance evaluation (including generalization ability) of Two-step USAM and Hessian USAM requires further work
  3. Theory-practice gap: Actual performance of SAM algorithms shows some discrepancy with theoretical predictions (e.g., strict saddle point issues)

Impact

  1. Theoretical contribution: Provides new analytical tools and perspectives for optimization theory
  2. Practical value: Offers principled guidance for optimization algorithm design
  3. Cross-disciplinary significance: Connects dynamical systems theory with machine learning practice

Applicable Scenarios

  1. Deep learning optimization: Particularly suitable for understanding and improving neural network training algorithms
  2. Non-convex optimization: Provides new analytical tools for general non-convex optimization problems
  3. Algorithm design: Guides design and analysis of novel optimization algorithms

References

This paper cites extensive related work, primarily including:

  • Classical dynamical systems theory literature
  • Modern optimization theory advances
  • Stability and generalization research in deep learning
  • Works on sharpness aware minimization
  • Theoretical and experimental studies on marginal stability phenomena

Overall Assessment: This is an excellent paper combining theoretical depth with practical value, providing new theoretical tools for understanding optimization phenomena in deep learning and demonstrating successful cases of theory-guided algorithm design. While there is room for improvement in large-scale experimental validation, its theoretical contributions and innovative perspective constitute important progress in optimization theory.