A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Xie, Wang, Wu et al.
Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.
academic
A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Title: A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Authors: Shuo Xie (Toyota Technological Institute at Chicago), Tianhao Wang (UC San Diego), Beining Wu (University of Chicago), Zhiyuan Li (Toyota Technological Institute at Chicago)
This paper systematically investigates the fundamental differences between two families of algorithms—adaptive optimizers (e.g., Adam, Shampoo) and normalized steepest descent (NSD, e.g., Lion, Muon)—in exploiting non-Euclidean geometric structures. The study reveals that while these two families can be equivalent when exponential moving averages (EMA) are disabled, their theoretical analyses rely on different geometric assumptions: adaptive optimizers require stronger "adaptive smoothness," while NSD requires only standard smoothness. The paper extends adaptive smoothness theory to the non-convex setting and proves that it precisely characterizes the convergence of adaptive optimizers. Importantly, the research demonstrates that adaptive smoothness enables adaptive optimizers with Nesterov momentum to achieve acceleration (O(T⁻²)) in the convex setting, whereas standard smoothness cannot guarantee this under certain non-Euclidean geometries. Additionally, the paper introduces the concept of "adaptive gradient variance," proving that it provides dimension-free convergence guarantees for NSD, which is unattainable under standard gradient variance assumptions.
This paper aims to answer two fundamental questions:
Q1: Do adaptive methods (e.g., Adam, Shampoo) and corresponding non-Euclidean descent methods (e.g., Lion, Muon) exploit the non-Euclidean geometry of loss functions in the same manner?
Q2: Do the stronger smoothness assumptions in adaptive methods provide actual optimization benefits?
Practical Value: Adaptive optimizers like Adam are indispensable in large-scale machine learning model training (e.g., LLaMA, DeepSeek), yet recent simple NSD methods like Lion and Muon have demonstrated surprising effectiveness, prompting investigation into the fundamental differences between these two classes.
Theoretical Gap: Although Bernstein & Newhouse (2024) noted that the two classes are equivalent when EMA is disabled (e.g., Adam is equivalent to ℓ∞-NSD, Shampoo is equivalent to spectral norm NSD), systematic theoretical characterization is lacking.
Geometric Perspective: Both classes' superior performance relates to exploiting non-Euclidean geometry of loss functions, yet their theoretical analyses depend on fundamentally different geometric assumptions.
Incomplete Convergence Theory: Adaptive smoothness theory has only been established in the convex setting (Xie et al., 2025b); the non-convex case lacks characterization.
Unclear Assumption Strength: Adaptive smoothness is always at least as strong as standard smoothness, but whether this stronger assumption provides practical benefits remains unproven.
Dimension Dependence Issues: NSD suffers from dimension dependence in stochastic optimization (e.g., √d factor in SignGD), lacking finer noise assumptions.
Non-convex Convergence Theory: Extends adaptive smoothness theory to the non-convex setting, proving that it precisely characterizes the convergence rate of adaptive optimizers (Theorems C.2, C.7, C.8), achieving optimal Õ(T⁻¹/⁴) rate.
Accelerated Convergence Guarantees: Proves that adaptive smoothness enables adaptive optimizers with Nesterov momentum to achieve Õ(T⁻²) acceleration rate in the convex setting (Theorem 4.4), whereas any optimizer under standard ℓ∞ smoothness can only achieve Ω(T⁻¹) (Guzmán & Nemirovski, 2015).
Adaptive Gradient Variance: Introduces the concept of adaptive gradient variance (Definition 4.1), proving that it provides dimension-free convergence guarantees for momentum-based NSD (Theorem 4.6), and establishes through lower bounds (Theorem 4.9) that dimension dependence is inevitable under standard gradient variance.
Unified Analysis Framework: Provides a unified analysis framework covering a broad range of adaptive methods including AdaGrad, AdaGrad-Norm, and one-sided Shampoo, with core technical contributions being new matrix inequalities (Lemmas 3.3, 3.4) for handling non-commutative preconditioners.
Theoretical Separation: Systematically establishes quantitative separations between the two geometric assumptions (standard vs. adaptive) along both smoothness and noise dimensions, deepening theoretical understanding of adaptivity.
where f:Rd→R may be non-convex. In the stochastic setting, the objective function is accessed through stochastic gradients ∇ft(x) satisfying E[∇ft(x)]=∇f(x).
For a well-structured preconditioner set H, define the induced norm:
∥x∥H:=supH∈H,Tr(H)≤1∥x∥H=supH∈H,Tr(H)≤1x⊤Hx
Lemma 2.2: The dual norm satisfies
∥x∥H,∗=infH∈H,Tr(H)≤1∥x∥H−1
This duality is key to understanding the two geometries: ∥⋅∥H is the pointwise supremum of all ∥⋅∥H, while ∥⋅∥H,∗ is the pointwise infimum of corresponding dual norms.
Intuition: Adaptive variance requires uniform noise control across all geometries induced by preconditioners, which is stronger than controlling noise under a fixed norm.
For adaptive optimizers with Nesterov momentum (Algorithm 2), on convex functions with αt=t+22 and η=D:
E[f(xˉT)−f(x∗)]=O~(T2ΛH(f)D2log2d+T2dϵD+TσHDlogd)
Comparison:
Under adaptive smoothness: O(T−2) acceleration rate (deterministic part)
Under standard ℓ∞ smoothness: Guzmán & Nemirovski (2015) proved any first-order method achieves only Ω(T−1)
Significance: Demonstrates the substantial advantage of adaptive smoothness—enabling acceleration, which standard smoothness cannot provide.
For NSD (Algorithm 3) under adaptive gradient variance σH:
E[T1∑t=0T−1∥∇f(xt)∥H,∗]≤ηTΔ0+α2ηL∥⋅∥H(f)+αT2σH+2σHα
With optimal choices α=σHTΔ0L∥⋅∥H(f) and η=L∥⋅∥H(f)1/4σH1/2Δ03/4T−3/4:
Rate=O(T1/4(Δ0L∥⋅∥H(f))1/4σH)
Dimension-Free: Compared to Pethick et al. (2025)'s O~(ρd/T1/4) (where ρ=supx∥x∥2∥x∥H,∗ can reach Θ(d)), this result completely eliminates dimension dependence.
Under standard ℓ₁ variance assumption E[∥∇ft(x)−∇f(x)∥12]≤σ2, for SignGD (ℓ∞-NSD) there exists a hard instance such that:
E[mint∈[T]∥∇f(xt)∥1]=min{e−25−41(dLΔ0σ2)1/4T−1/2,e−25−21σ}
Geometric Duality: Although adaptive optimizers and NSD both exploit non-Euclidean geometry, they depend on fundamentally different geometric assumptions:
NSD: Require only standard smoothness L∥⋅∥H(f), but need pre-specified norms
Value of Adaptivity: Stronger adaptive assumptions yield substantial benefits:
Acceleration: Achieves O(T⁻²) in convex case vs. Ω(T⁻¹) under standard assumptions
Dimension-Free: Eliminates dimension dependence in stochastic case
Unified Theoretical Framework: First to establish non-convex convergence theory for a broad class of adaptive methods including one-sided Shampoo, with core technique being new matrix inequalities for non-commutative preconditioners.
Tightness: Lower bounds demonstrate:
Dimension dependence Ω(d^{1/2}) is unavoidable under standard variance (Theorem 4.9)
Superiority of adaptive variance is not merely a technical artifact but a fundamental distinction
Xie et al. (2025b): "Structured Preconditioners in Adaptive Optimization: A Unified Analysis" - Foundation for convex case
Guzmán & Nemirovski (2015): "On lower complexity bounds for large-scale smooth convex optimization" - Lower bounds under ℓ∞ smoothness
Pethick et al. (2025): "Training deep learning models with norm-constrained lmos" - Recent NSD analysis
Kovalev (2025a): "SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration" - Parallel work
Bernstein & Newhouse (2024): "Old optimizer, new norm: An anthology" - Equivalence between Adam and NSD
Gupta et al. (2017): "A unified approach to adaptive regularization" - Adaptive optimizer framework
Lieb (1973): "Convex trace functions and the wigner-yanase-dyson conjecture" - Concavity basis for Lemma A.7
Summary: This paper represents significant progress in adaptive optimization theory, systematically revealing fundamental differences between adaptive methods and NSD in their geometric assumptions, and rigorously proving the substantial value of adaptivity through theoretical analysis. Despite lacking experimental validation, its theoretical depth and technical innovations make it an important reference in the field. The core contribution lies in establishing a complete theoretical system for "two geometries," providing new perspectives for understanding and designing adaptive optimization algorithms.