Convergence of optimizers implies eigenvalues filtering at equilibrium
Bolte, Le, Pauwels
Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.
academic
Convergence of optimizers implies eigenvalues filtering at equilibrium
Extensive empirical evidence from deep neural network training demonstrates that various optimizers tend to find solutions near global optima. This paper adopts a reverse perspective, assuming convergence to arbitrary points rather than proving convergence, and focuses on the consequences of this assumption. From this viewpoint, combined with recent advances in marginal stability phenomena, the authors argue that different optimizers effectively function as eigenvalue filters determined by their hyperparameters. Specifically, standard gradient descent methods inherently avoid the sharpest minima, while Sharpness Aware Minimization (SAM) algorithms further actively prefer wider basins. Based on these insights, the authors propose two novel algorithms demonstrating enhanced eigenvalue filtering capabilities that effectively promote wider minima. The theoretical analysis leverages the generalized Hadamard-Perron stable manifold theorem, applicable to general semi-algebraic C² functions without requiring additional non-degeneracy conditions or global Lipschitz boundedness assumptions.
This research addresses the fundamental question of understanding convergence behavior of optimization algorithms in deep learning, particularly how they select specific minima in the complex loss landscape. While traditional research focuses on proving convergence, this paper adopts a "reverse" perspective: assuming convergence has occurred, analyzing what this convergence implies about the geometric properties (especially Hessian eigenvalues) of the reached point.
Connection between stability and generalization: Stable training correlates with wide attraction basins and flat minima, properties closely related to generalization performance
Marginal stability phenomena: Empirical observations suggest standard training typically operates near stability boundaries
Practical implications: Understanding optimizers' implicit preferences aids in designing better training algorithms
Over the past decade, successful training of deep learning models has become the norm, prompting a shift in research perspective from "when does convergence occur" to "why does convergence succeed and how do hyperparameters enable it."
Unified theoretical framework: Proposes a unified analysis framework based on the generalized Hadamard-Perron stable manifold theorem, applicable to a broad class of optimization algorithms
Eigenvalue filtering theory: Proves that successful convergence of optimizers necessarily imposes constraints on Hessian eigenvalues at reached points, creating an "eigenvalue filtering" effect
Algorithm analysis: Systematically analyzes eigenvalue filtering properties of gradient descent, heavy ball method, Nesterov accelerated gradient, and USAM
Theorem 1.1: Let D∈Rm×m be an invertible matrix and g:Rm→Rm be a C1 semi-algebraic mapping. For almost all x0∈Rm and α>0, if the sequence (xk)k∈N converges to some point xˉ, then the spectral radius of the Jacobian of D−αg at xˉ is at most 1:
ρ(JacGα(xˉ))≤1
Theorem 2.1: There exists Λ⊂R+ whose complement is finite, such that for any α∈Λ, the set
Wα={x0∈Rm∣∃xˉ s.t. Gα(xˉ)=xˉ,ρ(JacGα(xˉ))>1,xk→xˉ}
is contained in a countable union of C1 submanifolds of dimension at most m−1.
Stability requirements: Two-step USAM and Hessian USAM require smaller ρ values to avoid training failure, consistent with theoretical predictions of stricter curvature constraints
Batch normalization effects: In architectures using batch normalization, the flattening effects of SAM-class algorithms are less pronounced, which does not contradict theory as batch normalization alters algorithm dynamics
Unified perspective: Successful optimizer training is essentially an eigenvalue filtering process, with different algorithms achieving varying degrees of filtering through hyperparameters
Theoretical extension: The generalized stable manifold theorem provides a powerful theoretical tool for understanding optimization algorithms
Practical guidance: Theoretical results provide principled guidance for designing new optimization algorithms
Theoretical innovation: Provides a novel perspective for understanding optimization algorithms, shifting from "convergence proofs" to "convergence consequence analysis"
Unified framework: First to provide a unified theoretical framework for analyzing eigenvalue filtering behavior across multiple optimization algorithms
Practical value: Theoretical results directly guide new algorithm design and are validated experimentally
Technical rigor: Mathematical derivations are rigorous with clearly stated and reasonable assumptions
Limited experimental scale: Experiments primarily conducted on relatively simple architectures and datasets; large-scale experimental validation is insufficient
New algorithm evaluation: Comprehensive performance evaluation (including generalization ability) of Two-step USAM and Hessian USAM requires further work
Theory-practice gap: Actual performance of SAM algorithms shows some discrepancy with theoretical predictions (e.g., strict saddle point issues)
This paper cites extensive related work, primarily including:
Classical dynamical systems theory literature
Modern optimization theory advances
Stability and generalization research in deep learning
Works on sharpness aware minimization
Theoretical and experimental studies on marginal stability phenomena
Overall Assessment: This is an excellent paper combining theoretical depth with practical value, providing new theoretical tools for understanding optimization phenomena in deep learning and demonstrating successful cases of theory-guided algorithm design. While there is room for improvement in large-scale experimental validation, its theoretical contributions and innovative perspective constitute important progress in optimization theory.