2025-11-27T08:46:18.590812

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Xie, Wang, Wu et al.
Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.
academic

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Basic Information

  • Paper ID: 2511.20584
  • Title: A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
  • Authors: Shuo Xie (Toyota Technological Institute at Chicago), Tianhao Wang (UC San Diego), Beining Wu (University of Chicago), Zhiyuan Li (Toyota Technological Institute at Chicago)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: November 25, 2025 (arXiv v1)
  • Paper Link: https://arxiv.org/abs/2511.20584

Abstract

This paper systematically investigates the fundamental differences between two families of algorithms—adaptive optimizers (e.g., Adam, Shampoo) and normalized steepest descent (NSD, e.g., Lion, Muon)—in exploiting non-Euclidean geometric structures. The study reveals that while these two families can be equivalent when exponential moving averages (EMA) are disabled, their theoretical analyses rely on different geometric assumptions: adaptive optimizers require stronger "adaptive smoothness," while NSD requires only standard smoothness. The paper extends adaptive smoothness theory to the non-convex setting and proves that it precisely characterizes the convergence of adaptive optimizers. Importantly, the research demonstrates that adaptive smoothness enables adaptive optimizers with Nesterov momentum to achieve acceleration (O(T⁻²)) in the convex setting, whereas standard smoothness cannot guarantee this under certain non-Euclidean geometries. Additionally, the paper introduces the concept of "adaptive gradient variance," proving that it provides dimension-free convergence guarantees for NSD, which is unattainable under standard gradient variance assumptions.

Research Background and Motivation

Core Questions

This paper aims to answer two fundamental questions:

  1. Q1: Do adaptive methods (e.g., Adam, Shampoo) and corresponding non-Euclidean descent methods (e.g., Lion, Muon) exploit the non-Euclidean geometry of loss functions in the same manner?
  2. Q2: Do the stronger smoothness assumptions in adaptive methods provide actual optimization benefits?

Research Significance

  • Practical Value: Adaptive optimizers like Adam are indispensable in large-scale machine learning model training (e.g., LLaMA, DeepSeek), yet recent simple NSD methods like Lion and Muon have demonstrated surprising effectiveness, prompting investigation into the fundamental differences between these two classes.
  • Theoretical Gap: Although Bernstein & Newhouse (2024) noted that the two classes are equivalent when EMA is disabled (e.g., Adam is equivalent to ℓ∞-NSD, Shampoo is equivalent to spectral norm NSD), systematic theoretical characterization is lacking.
  • Geometric Perspective: Both classes' superior performance relates to exploiting non-Euclidean geometry of loss functions, yet their theoretical analyses depend on fundamentally different geometric assumptions.

Limitations of Existing Methods

  1. Incomplete Convergence Theory: Adaptive smoothness theory has only been established in the convex setting (Xie et al., 2025b); the non-convex case lacks characterization.
  2. Unclear Assumption Strength: Adaptive smoothness is always at least as strong as standard smoothness, but whether this stronger assumption provides practical benefits remains unproven.
  3. Dimension Dependence Issues: NSD suffers from dimension dependence in stochastic optimization (e.g., √d factor in SignGD), lacking finer noise assumptions.

Core Contributions

  1. Non-convex Convergence Theory: Extends adaptive smoothness theory to the non-convex setting, proving that it precisely characterizes the convergence rate of adaptive optimizers (Theorems C.2, C.7, C.8), achieving optimal Õ(T⁻¹/⁴) rate.
  2. Accelerated Convergence Guarantees: Proves that adaptive smoothness enables adaptive optimizers with Nesterov momentum to achieve Õ(T⁻²) acceleration rate in the convex setting (Theorem 4.4), whereas any optimizer under standard ℓ∞ smoothness can only achieve Ω(T⁻¹) (Guzmán & Nemirovski, 2015).
  3. Adaptive Gradient Variance: Introduces the concept of adaptive gradient variance (Definition 4.1), proving that it provides dimension-free convergence guarantees for momentum-based NSD (Theorem 4.6), and establishes through lower bounds (Theorem 4.9) that dimension dependence is inevitable under standard gradient variance.
  4. Unified Analysis Framework: Provides a unified analysis framework covering a broad range of adaptive methods including AdaGrad, AdaGrad-Norm, and one-sided Shampoo, with core technical contributions being new matrix inequalities (Lemmas 3.3, 3.4) for handling non-commutative preconditioners.
  5. Theoretical Separation: Systematically establishes quantitative separations between the two geometric assumptions (standard vs. adaptive) along both smoothness and noise dimensions, deepening theoretical understanding of adaptivity.

Detailed Methodology

Problem Formulation

Consider the optimization problem: minxRdf(x)\min_{x \in \mathbb{R}^d} f(x)

where f:RdRf: \mathbb{R}^d \to \mathbb{R} may be non-convex. In the stochastic setting, the objective function is accessed through stochastic gradients ft(x)\nabla f_t(x) satisfying E[ft(x)]=f(x)\mathbb{E}[\nabla f_t(x)] = \nabla f(x).

Core Concepts

1. Well-Structured Preconditioner Set

Definition 2.1: HS+d\mathcal{H} \subseteq \mathbb{S}_+^d is called a well-structured preconditioner set if H=S+dK\mathcal{H} = \mathbb{S}_+^d \cap \mathcal{K}, where KMd\mathcal{K} \subseteq \mathbb{M}^d is a matrix subalgebra containing the identity matrix.

Examples:

  • Diagonal matrices D+d\mathcal{D}_+^d (corresponding to Adam)
  • Full PSD matrices S+d\mathbb{S}_+^d (corresponding to full-matrix AdaGrad)
  • Scalar matrices {cId:c>0}\{cI_d: c>0\} (corresponding to AdaGrad-Norm)
  • Kronecker product structure SdL+IdR\mathbb{S}_{d_L}^+ \otimes I_{d_R} (corresponding to one-sided Shampoo)

2. Induced Norms and Dual Norms

For a well-structured preconditioner set H\mathcal{H}, define the induced norm: xH:=supHH,Tr(H)1xH=supHH,Tr(H)1xHx\|x\|_{\mathcal{H}} := \sup_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \|x\|_H = \sup_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \sqrt{x^\top H x}

Lemma 2.2: The dual norm satisfies xH,=infHH,Tr(H)1xH1\|x\|_{\mathcal{H},*} = \inf_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \|x\|_{H^{-1}}

This duality is key to understanding the two geometries: H\|\cdot\|_{\mathcal{H}} is the pointwise supremum of all H\|\cdot\|_H, while H,\|\cdot\|_{\mathcal{H},*} is the pointwise infimum of corresponding dual norms.

3. Two Classes of Smoothness

Standard Smoothness (Definition 2.3): L(f):=min{L:f(x)f(y)Lxy,x,y}L_{\|\cdot\|}(f) := \min\{L: \|\nabla f(x) - \nabla f(y)\|_* \leq L\|x-y\|, \forall x,y\}

Adaptive Smoothness (Definition 2.4): ΛH(f):=minHH,Tr(H)1LH(f)=minHH,x:H2f(x)HTr(H)\Lambda_{\mathcal{H}}(f) := \min_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} L_{\|\cdot\|_H}(f) = \min_{H \in \mathcal{H}, \forall x: -H \preceq \nabla^2 f(x) \preceq H} \text{Tr}(H)

Relationship (Proposition 2.5): LH(f)ΛH(f)dLH(f)L_{\|\cdot\|_{\mathcal{H}}}(f) \leq \Lambda_{\mathcal{H}}(f) \leq d \cdot L_{\|\cdot\|_{\mathcal{H}}}(f)

Adaptive smoothness is always at least as strong as standard smoothness, but differs by at most a dimension factor dd.

Unified Adaptive Optimizer Framework (Algorithm 1)

Algorithm Structure:

Input: Initial point x₀, learning rate η, preconditioner set H, stability constant ϵ
Initialize: M₋₁ = 0
For t = 0, 1, ..., T-1:
    gₜ ← ∇fₜ(xₜ)
    Mₜ ← Accumulation(Mₜ₋₁, gₜ)  // Three variants
    Vₜ ← argmin_{H∈H} ⟨Mₜ + ϵI, H⁻¹⟩ + Tr(H)
    xₜ₊₁ ← xₜ - ηVₜ⁻¹gₜ
Return x_T

Three Accumulation Variants:

  1. Cumulative: Mt=Mt1+gtgtM_t = M_{t-1} + g_t g_t^\top (AdaGrad)
  2. EMA: Mt=βMt1+(1β)gtgtM_t = \beta M_{t-1} + (1-\beta)g_t g_t^\top (Adam)
  3. Weighted: Mt=βMt1+gtgtM_t = \beta M_{t-1} + g_t g_t^\top (for unified analysis)

Key Observation: Vt=PH(Mt+ϵI)V_t = \mathcal{P}_{\mathcal{H}}(M_t + \epsilon I), where PH(M)2\mathcal{P}_{\mathcal{H}}(M)^2 is the projection of MM onto H\mathcal{H} (Lemma A.4).

Technical Innovations

1. New Matrix Inequality (Lemma 3.4)

For positive definite matrices X,YX, Y satisfying YXY \preceq X, for any 0cC0 \leq c \leq C: X1/2(XY)X1/23(logClogc)π2(logXlogY)+(12cdπ2λmin(X)2+12C1dπ2)Tr(XY)IX^{-1/2}(X-Y)X^{-1/2} \preceq \frac{3(\log C - \log c)}{\pi^2}(\log X - \log Y) + \left(\frac{12cd}{\pi^2\lambda_{\min}(X)^2} + \frac{12C^{-1}d}{\pi^2}\right)\text{Tr}(X-Y) \cdot I

Significance:

  • When matrices commute, logarithmic telescoping yields tight bounds
  • In the non-commutative case, the second term quantifies the "non-commutativity cost," introducing a logd\log d factor
  • By carefully choosing c,Cc, C, the cost is controlled to logd\log d in Lemma 3.3

2. Second-Order Term Control (Lemma 3.3)

For the weighted algorithm, define ST=t=0T1Vt1(Vt2βVt12)Vt1S_T = \sum_{t=0}^{T-1} V_t^{-1}(V_t^2 - \beta V_{t-1}^2)V_t^{-1}, then: t=0T1Vt1gtH2Tr(H)STop\sum_{t=0}^{T-1} \|V_t^{-1}g_t\|_H^2 \leq \text{Tr}(H) \|S_T\|_{\text{op}}

and there exist constants C1,C2C_1, C_2 such that: STopC1(1+log(1+dϵt=0T1gt22+d2(1β)T))((1β)Tβ+logVT12/ϵop)+C2\|S_T\|_{\text{op}} \leq C_1\left(1 + \log\left(1 + \frac{d}{\epsilon}\sum_{t=0}^{T-1}\|g_t\|_2^2 + d^2(1-\beta)T\right)\right)\left(\frac{(1-\beta)T}{\beta} + \log\|V_{T-1}^2/\epsilon\|_{\text{op}}\right) + C_2

Special Case: When H\mathcal{H} is commutative (e.g., diagonal matrices), this improves to STop(1β)T+logVT12/ϵop\|S_T\|_{\text{op}} \leq (1-\beta)T + \log\|V_{T-1}^2/\epsilon\|_{\text{op}}.

3. Adaptive Gradient Variance (Definition 4.1)

σH({ft})2:=minHH,Tr(H)1supt,xE[ft(x)E[ft(x)]H12]\sigma_{\mathcal{H}}(\{f_t\})^2 := \min_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \sup_{t, x} \mathbb{E}[\|\nabla f_t(x) - \mathbb{E}[\nabla f_t(x)]\|_{H^{-1}}^2]

Relationship (Proposition 4.2): σH,({ft})2σH({ft})2dσH,({ft})2\sigma_{\|\cdot\|_{\mathcal{H},*}}(\{f_t\})^2 \leq \sigma_{\mathcal{H}}(\{f_t\})^2 \leq d \cdot \sigma_{\|\cdot\|_{\mathcal{H},*}}(\{f_t\})^2

Intuition: Adaptive variance requires uniform noise control across all geometries induced by preconditioners, which is stronger than controlling noise under a fixed norm.

Experimental Setup

Note: This is a purely theoretical paper without experimental sections. All results are theoretical convergence rates and lower bound proofs.

Theoretical Analysis Setup

Assumptions

  1. Smoothness:
    • Standard smoothness: f(x)f(y)H,LH(f)xyH\|\nabla f(x) - \nabla f(y)\|_{\mathcal{H},*} \leq L_{\|\cdot\|_{\mathcal{H}}}(f)\|x-y\|_{\mathcal{H}}
    • Adaptive smoothness: ΛH(f)=minHH,Tr(H)1LH(f)\Lambda_{\mathcal{H}}(f) = \min_{H \in \mathcal{H}, \text{Tr}(H)\leq 1} L_{\|\cdot\|_H}(f)
  2. Noise Assumption (Assumption C.1):
    • E[ft(x)]=f(x)\mathbb{E}[\nabla f_t(x)] = \nabla f(x)
    • There exists Σ0\Sigma \succeq 0 such that Σf(x)f(x)ft(x)ft(x)Σ-\Sigma \preceq \nabla f(x)\nabla f(x)^\top - \nabla f_t(x)\nabla f_t(x)^\top \preceq \Sigma
  3. Convexity: Some results (acceleration) require ff to be convex

Analysis Methods

  • Descent Lemma: Establish single-step descent relations using smoothness
  • Telescoping Sums: Perform telescoping summation on accumulated terms
  • Matrix Inequalities: Control second-order terms from preconditioner changes
  • Probabilistic Methods: Handle stochastic noise through conditional expectation and variance decomposition
  • Constructive Lower Bounds: Prove tightness through carefully designed hard instances

Experimental Results

Main Theoretical Results

1. Non-convex Convergence Rate (Theorem 3.2)

For cumulative adaptive optimizers (AdaGrad-type) on deterministic non-convex functions: 1Tt=0T1f(xt)H,1T(ξ+dϵ1/4ξ)\frac{1}{T}\sum_{t=0}^{T-1} \|\nabla f(x_t)\|_{\mathcal{H},*} \leq \frac{1}{\sqrt{T}}\left(\xi + \sqrt{d}\epsilon^{1/4}\sqrt{\xi}\right)

where: ξ=O~(Δ0η+ηΛH(f)log2d)\xi = \tilde{O}\left(\frac{\Delta_0}{\eta} + \eta \cdot \Lambda_{\mathcal{H}}(f) \log^2 d\right)

Choosing η=Δ0ΛH(f)log2d\eta = \sqrt{\frac{\Delta_0}{\Lambda_{\mathcal{H}}(f)\log^2 d}} achieves O~(Δ0ΛH(f)logdT)\tilde{O}\left(\frac{\sqrt{\Delta_0 \Lambda_{\mathcal{H}}(f)}\log d}{\sqrt{T}}\right).

Key Points:

  • Convergence rate depends on adaptive smoothness ΛH(f)\Lambda_{\mathcal{H}}(f) rather than standard smoothness
  • Diagonal preconditioners (e.g., Adam) have no logd\log d factor
  • General well-structured preconditioners introduce a logd\log d factor (non-commutativity cost)

2. Accelerated Convergence Rate (Theorem 4.4)

For adaptive optimizers with Nesterov momentum (Algorithm 2), on convex functions with αt=2t+2\alpha_t = \frac{2}{t+2} and η=D\eta = D: E[f(xˉT)f(x)]=O~(ΛH(f)D2log2dT2+dϵDT2+σHDlogdT)\mathbb{E}[f(\bar{x}_T) - f(x^*)] = \tilde{O}\left(\frac{\Lambda_{\mathcal{H}}(f)D^2\log^2 d}{T^2} + \frac{d\sqrt{\epsilon}D}{T^2} + \frac{\sigma_{\mathcal{H}}D\log d}{\sqrt{T}}\right)

Comparison:

  • Under adaptive smoothness: O(T2)O(T^{-2}) acceleration rate (deterministic part)
  • Under standard ℓ∞ smoothness: Guzmán & Nemirovski (2015) proved any first-order method achieves only Ω(T1)\Omega(T^{-1})

Significance: Demonstrates the substantial advantage of adaptive smoothness—enabling acceleration, which standard smoothness cannot provide.

3. Dimension-Free Convergence Rate (Theorem 4.6)

For NSD (Algorithm 3) under adaptive gradient variance σH\sigma_{\mathcal{H}}: E[1Tt=0T1f(xt)H,]Δ0ηT+2ηαLH(f)+2σHαT+2σHα\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla f(x_t)\|_{\mathcal{H},*}\right] \leq \frac{\Delta_0}{\eta T} + \frac{2\eta}{\alpha}L_{\|\cdot\|_{\mathcal{H}}}(f) + \frac{2\sigma_{\mathcal{H}}}{\alpha T} + 2\sigma_{\mathcal{H}}\sqrt{\alpha}

With optimal choices α=Δ0LH(f)σHT\alpha = \frac{\sqrt{\Delta_0 L_{\|\cdot\|_{\mathcal{H}}}(f)}}{\sigma_{\mathcal{H}}\sqrt{T}} and η=Δ03/4LH(f)1/4σH1/2T3/4\eta = \frac{\Delta_0^{3/4}}{L_{\|\cdot\|_{\mathcal{H}}}(f)^{1/4}\sigma_{\mathcal{H}}^{1/2}}T^{-3/4}: Rate=O((Δ0LH(f))1/4σHT1/4)\text{Rate} = O\left(\frac{(\Delta_0 L_{\|\cdot\|_{\mathcal{H}}}(f))^{1/4}\sqrt{\sigma_{\mathcal{H}}}}{T^{1/4}}\right)

Dimension-Free: Compared to Pethick et al. (2025)'s O~(ρd/T1/4)\tilde{O}(\rho\sqrt{d}/T^{1/4}) (where ρ=supxxH,x2\rho = \sup_x \frac{\|x\|_{\mathcal{H},*}}{\|x\|_2} can reach Θ(d)\Theta(\sqrt{d})), this result completely eliminates dimension dependence.

4. Dimension Dependence Lower Bound (Theorem 4.9)

Under standard ℓ₁ variance assumption E[ft(x)f(x)12]σ2\mathbb{E}[\|\nabla f_t(x) - \nabla f(x)\|_1^2] \leq \sigma^2, for SignGD (ℓ∞-NSD) there exists a hard instance such that: E[mint[T]f(xt)1]=min{e2514(dLΔ0σ2)1/4T1/2,e2512σ}\mathbb{E}\left[\min_{t \in [T]}\|\nabla f(x_t)\|_1\right] = \min\left\{e^{-25-\frac{1}{4}}(dL\Delta_0\sigma^2)^{1/4}T^{-1/2}, e^{-25-\frac{1}{2}}\sigma\right\}

Significance:

  • Achieving error ϵ<e251/2σ\epsilon < e^{-25-1/2}\sigma requires T=Ω(ϵ2(dLΔ0σ2)1/2)T = \Omega(\epsilon^{-2}(dL\Delta_0\sigma^2)^{1/2}) steps
  • Dimension dependence Ω(d1/2)\Omega(d^{1/2}) is unavoidable under standard variance assumptions
  • Contrasts with Theorem 4.6's dimension-free upper bound, proving the fundamental superiority of adaptive variance

Key Insights

1. Quantification of Geometric Separation

Relations between the two smoothness and variance classes:

  • Smoothness: LH(f)ΛH(f)dLH(f)L_{\|\cdot\|_{\mathcal{H}}}(f) \leq \Lambda_{\mathcal{H}}(f) \leq d \cdot L_{\|\cdot\|_{\mathcal{H}}}(f)
  • Variance: σH,2σH2dσH,2\sigma_{\|\cdot\|_{\mathcal{H},*}}^2 \leq \sigma_{\mathcal{H}}^2 \leq d \cdot \sigma_{\|\cdot\|_{\mathcal{H},*}}^2

The gap is at most dimension dd, but is tight in some cases (e.g., diagonal matrices vs. full matrices).

2. Ineffectiveness of Averaging

Under non-Euclidean geometry, averaging cannot effectively reduce norms:

  • ℓ₂: 1ni=1nxi2=O(1/n)\|\frac{1}{n}\sum_{i=1}^n x_i\|_2 = O(1/\sqrt{n}) (effective)
  • ℓ₁: 1ni=1nxi1=O(d/n)\|\frac{1}{n}\sum_{i=1}^n x_i\|_1 = O(\sqrt{d}/\sqrt{n}) (dimension-dependent)

This explains why:

  • Acceleration requires stronger adaptive smoothness
  • Momentum cannot eliminate dimension dependence under standard variance

3. Non-Commutativity Cost

General well-structured preconditioners (e.g., one-sided Shampoo) introduce a logd\log d factor, stemming from:

  • Matrix non-commutativity preventing direct telescoping
  • Non-commutative term in Lemma 3.4: 12cdπ2λmin2Tr(XY)I\frac{12cd}{\pi^2\lambda_{\min}^2}\text{Tr}(X-Y) \cdot I
  • Careful parameter selection controls the cost to logd\log d

Adaptive Optimizers

  1. Matrix Preconditioning: Shampoo (Gupta et al., 2018) and variants (SOAP, PolarGrad, AdaMuon)
  2. Diagonal Preconditioning: AdaGrad (Duchi et al., 2011), Adam (Kingma & Ba, 2014)
  3. Convergence Theory:
    • Convex case: Xie et al. (2025b) establish adaptive smoothness theory
    • Diagonal case: Maladkar et al. (2024), Xie et al. (2025a)
  4. Variance Adaptation: Frans et al. (2025) discuss matrix whitening perspective

Normalized Steepest Descent (NSD)

  1. Convergence Analysis:
    • Cutkosky & Mehta (2020): O(T⁻³·⁵) non-convex rate
    • Pethick et al. (2025), Kovalev (2025b): analysis under general norms
  2. Equivalence:
    • Lion = ℓ∞-NSD (Sfyraki & Wang, 2025)
    • Muon = spectral norm-NSD (Chen et al., 2025)
  3. Dimension Dependence: Jiang et al. (2025) improvements for SignGD

Acceleration Methods

  1. Mirror Descent Perspective: Kelner et al. (2014), Allen-Zhu & Orecchia (2014)
  2. Adaptive Acceleration: Cutkosky (2019), Kavis et al. (2019), Joulani et al. (2020) for diagonal/coordinate adaptivity
  3. Lower Bounds: Guzmán & Nemirovski (2015) prove Ω(T⁻¹) lower bound under ℓ∞ smoothness

Comparison with This Work

  • vs Xie et al. (2025b): Extends to non-convex + stochastic settings
  • vs Kovalev (2025a): Weaker assumptions (avoiding Assumption 4), broader preconditioner classes
  • vs Pethick et al. (2025): Eliminates dimension dependence through adaptive variance
  • vs Existing Acceleration Methods: First to prove acceleration under general well-structured preconditioners

Conclusions and Discussion

Main Conclusions

  1. Geometric Duality: Although adaptive optimizers and NSD both exploit non-Euclidean geometry, they depend on fundamentally different geometric assumptions:
    • Adaptive Optimizers: Require adaptive smoothness ΛH(f)\Lambda_{\mathcal{H}}(f), automatically adapting to optimal preconditioners
    • NSD: Require only standard smoothness LH(f)L_{\|\cdot\|_{\mathcal{H}}}(f), but need pre-specified norms
  2. Value of Adaptivity: Stronger adaptive assumptions yield substantial benefits:
    • Acceleration: Achieves O(T⁻²) in convex case vs. Ω(T⁻¹) under standard assumptions
    • Dimension-Free: Eliminates dimension dependence in stochastic case
  3. Unified Theoretical Framework: First to establish non-convex convergence theory for a broad class of adaptive methods including one-sided Shampoo, with core technique being new matrix inequalities for non-commutative preconditioners.
  4. Tightness: Lower bounds demonstrate:
    • Dimension dependence Ω(d^{1/2}) is unavoidable under standard variance (Theorem 4.9)
    • Superiority of adaptive variance is not merely a technical artifact but a fundamental distinction

Limitations

  1. Purely Theoretical: Lacks experimental validation of theoretical predictions (e.g., actual convergence behavior under different geometries)
  2. Constant Factors:
    • Non-diagonal preconditioners introduce logd\log d factor (may be insignificant in practice)
    • Acceleration algorithm requires knowing D=maxtxtxHD = \max_t \|x_t - x^*\|_{\mathcal{H}} (mitigated through projection version)
  3. Assumption Conditions:
    • Assumption C.1 (pointwise covariance upper bound) is stronger than standard expectation assumptions
    • Acceleration results require convexity
  4. Applicability:
    • How to verify adaptive variance assumption in practice?
    • Which real problems satisfy adaptive smoothness?
  5. EMA Analysis: EMA variant requires β=1Θ(logdT)\beta = 1 - \Theta(\frac{\log d}{T}), while practice typically uses fixed β\beta (e.g., 0.9, 0.999)

Future Directions

  1. Experimental Validation:
    • Verify adaptive smoothness assumptions on real deep learning tasks
    • Compare empirical convergence behavior under different geometries
  2. Relaxing Assumptions:
    • Explore weaker noise assumptions (e.g., only expectation bounds)
    • Possibility of acceleration in non-convex case
  3. Algorithm Improvements:
    • Adaptive selection of preconditioner structure H\mathcal{H}
    • New optimization algorithms leveraging adaptive smoothness
  4. Other Geometries:
    • Extension to Bregman divergences, Riemannian geometry
    • Other structured preconditioners (sparse, low-rank)
  5. Lower Bound Improvements:
    • Lower bounds under adaptive smoothness (currently only under standard smoothness)
    • Tighter bounds in non-convex case

In-Depth Evaluation

Strengths

  1. Theoretical Depth:
    • First systematic quantitative separation of two geometric assumption classes
    • Core matrix inequality (Lemma 3.4) has independent value, potentially applicable to other matrix analysis problems
    • Sophisticated proof techniques, particularly for handling non-commutativity
  2. Unification:
    • Covers broad range of methods: AdaGrad, Adam, Shampoo, etc.
    • Clear equivalence analysis of three accumulation modes (cumulative, EMA, weighted)
    • Parallel treatment of smoothness and variance reveals underlying structure
  3. Completeness:
    • Upper bounds + lower bounds prove tightness
    • Deterministic + stochastic, convex + non-convex coverage
    • Detailed technical appendix (48 pages) ensures reproducibility
  4. Insightfulness:
    • "Averaging ineffectiveness" explains root of dimension dependence
    • Geometric intuition for duality (supremum vs. infimum)
    • Precise quantification of non-commutativity cost
  5. Writing Quality:
    • Clear structure, introducing concepts through Adam/SignGD examples
    • Figure 1 intuitively illustrates duality
    • Good balance between technical details and intuition

Weaknesses

  1. Practical Relevance:
    • Lacks experimental validation of theoretical predictions
    • Universality of adaptive smoothness in real problems unknown
    • Is logd\log d factor significant in practice?
  2. Assumption Strength:
    • Assumption C.1 stronger than standard assumptions (almost everywhere)
    • Acceleration requires convexity and known DD
    • EMA requires β=1Θ(logd/T)\beta = 1 - \Theta(\log d / T), inconsistent with practice
  3. Technical Limitations:
    • Can the gap between diagonal and general cases (factor logd\log d) be eliminated?
    • Impossibility of non-convex acceleration not proven
    • Lower bounds for adaptive smoothness missing
  4. Presentation Details:
    • Õ notation hides logarithmic dependence on multiple parameters (not just dd)
    • Some constants (e.g., C1,C2C_1, C_2) not explicitly stated
    • Parameter selection strategy in Lemma 3.4 could be more explicit
  5. Related Work:
    • Distinction from parallel work (Kovalev & Borodich, 2025) could be more detailed
    • Connection to classical mirror descent theory could be deepened

Impact

  1. Theoretical Contribution:
    • Provides new perspective for adaptive optimization theory (hierarchy of geometric assumptions)
    • Matrix inequality techniques may influence related fields (matrix analysis, quantum information)
    • Unified framework may become standard for future analysis
  2. Practical Value:
    • Guides optimizer selection: when to use adaptive methods vs. NSD?
    • Inspires new algorithm designs (adaptive H\mathcal{H} selection)
    • Provides theoretical basis for hyperparameter tuning (e.g., β\beta selection)
  3. Reproducibility:
    • Purely theoretical work with verifiable results
    • Detailed proof techniques enable extension to other settings
    • Clear definitions facilitate citation in future research
  4. Limitations:
    • Lack of experiments limits immediate impact
    • Assumption verification requires follow-up work
    • Theory-practice gap needs bridging

Applicable Scenarios

  1. Theoretical Research:
    • Convergence analysis of optimization algorithms
    • Optimization theory under non-Euclidean geometry
    • Theoretical foundations of adaptive methods
  2. Algorithm Design:
    • Design guidance for new adaptive optimizers
    • Preconditioner structure selection
    • Acceleration method improvements
  3. Practical Applications:
    • Optimizer selection in large-scale machine learning
    • Understanding success of Adam and similar methods
    • Theoretical basis for debugging convergence issues
  4. Education:
    • Advanced material for optimization theory courses
    • Case study for non-Euclidean optimization
    • Applications of matrix analysis techniques

Selected Key References

  1. Xie et al. (2025b): "Structured Preconditioners in Adaptive Optimization: A Unified Analysis" - Foundation for convex case
  2. Guzmán & Nemirovski (2015): "On lower complexity bounds for large-scale smooth convex optimization" - Lower bounds under ℓ∞ smoothness
  3. Pethick et al. (2025): "Training deep learning models with norm-constrained lmos" - Recent NSD analysis
  4. Kovalev (2025a): "SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration" - Parallel work
  5. Bernstein & Newhouse (2024): "Old optimizer, new norm: An anthology" - Equivalence between Adam and NSD
  6. Gupta et al. (2017): "A unified approach to adaptive regularization" - Adaptive optimizer framework
  7. Lieb (1973): "Convex trace functions and the wigner-yanase-dyson conjecture" - Concavity basis for Lemma A.7

Summary: This paper represents significant progress in adaptive optimization theory, systematically revealing fundamental differences between adaptive methods and NSD in their geometric assumptions, and rigorously proving the substantial value of adaptivity through theoretical analysis. Despite lacking experimental validation, its theoretical depth and technical innovations make it an important reference in the field. The core contribution lies in establishing a complete theoretical system for "two geometries," providing new perspectives for understanding and designing adaptive optimization algorithms.