2025-11-27T08:46:18.590812

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Xie, Wang, Wu et al.

Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.

academic

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Basic Information

Paper ID: 2511.20584
Title: A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Authors: Shuo Xie (Toyota Technological Institute at Chicago), Tianhao Wang (UC San Diego), Beining Wu (University of Chicago), Zhiyuan Li (Toyota Technological Institute at Chicago)
Classification: cs.LG (Machine Learning)
Publication Date: November 25, 2025 (arXiv v1)
Paper Link: https://arxiv.org/abs/2511.20584

Abstract

This paper systematically investigates the fundamental differences between two families of algorithms—adaptive optimizers (e.g., Adam, Shampoo) and normalized steepest descent (NSD, e.g., Lion, Muon)—in exploiting non-Euclidean geometric structures. The study reveals that while these two families can be equivalent when exponential moving averages (EMA) are disabled, their theoretical analyses rely on different geometric assumptions: adaptive optimizers require stronger "adaptive smoothness," while NSD requires only standard smoothness. The paper extends adaptive smoothness theory to the non-convex setting and proves that it precisely characterizes the convergence of adaptive optimizers. Importantly, the research demonstrates that adaptive smoothness enables adaptive optimizers with Nesterov momentum to achieve acceleration (O(T⁻²)) in the convex setting, whereas standard smoothness cannot guarantee this under certain non-Euclidean geometries. Additionally, the paper introduces the concept of "adaptive gradient variance," proving that it provides dimension-free convergence guarantees for NSD, which is unattainable under standard gradient variance assumptions.

Research Background and Motivation

Core Questions

This paper aims to answer two fundamental questions:

Q1: Do adaptive methods (e.g., Adam, Shampoo) and corresponding non-Euclidean descent methods (e.g., Lion, Muon) exploit the non-Euclidean geometry of loss functions in the same manner?
Q2: Do the stronger smoothness assumptions in adaptive methods provide actual optimization benefits?

Research Significance

Practical Value: Adaptive optimizers like Adam are indispensable in large-scale machine learning model training (e.g., LLaMA, DeepSeek), yet recent simple NSD methods like Lion and Muon have demonstrated surprising effectiveness, prompting investigation into the fundamental differences between these two classes.
Theoretical Gap: Although Bernstein & Newhouse (2024) noted that the two classes are equivalent when EMA is disabled (e.g., Adam is equivalent to ℓ∞-NSD, Shampoo is equivalent to spectral norm NSD), systematic theoretical characterization is lacking.
Geometric Perspective: Both classes' superior performance relates to exploiting non-Euclidean geometry of loss functions, yet their theoretical analyses depend on fundamentally different geometric assumptions.

Limitations of Existing Methods

Incomplete Convergence Theory: Adaptive smoothness theory has only been established in the convex setting (Xie et al., 2025b); the non-convex case lacks characterization.
Unclear Assumption Strength: Adaptive smoothness is always at least as strong as standard smoothness, but whether this stronger assumption provides practical benefits remains unproven.
Dimension Dependence Issues: NSD suffers from dimension dependence in stochastic optimization (e.g., √d factor in SignGD), lacking finer noise assumptions.

Core Contributions

Non-convex Convergence Theory: Extends adaptive smoothness theory to the non-convex setting, proving that it precisely characterizes the convergence rate of adaptive optimizers (Theorems C.2, C.7, C.8), achieving optimal Õ(T⁻¹/⁴) rate.
Accelerated Convergence Guarantees: Proves that adaptive smoothness enables adaptive optimizers with Nesterov momentum to achieve Õ(T⁻²) acceleration rate in the convex setting (Theorem 4.4), whereas any optimizer under standard ℓ∞ smoothness can only achieve Ω(T⁻¹) (Guzmán & Nemirovski, 2015).
Adaptive Gradient Variance: Introduces the concept of adaptive gradient variance (Definition 4.1), proving that it provides dimension-free convergence guarantees for momentum-based NSD (Theorem 4.6), and establishes through lower bounds (Theorem 4.9) that dimension dependence is inevitable under standard gradient variance.
Unified Analysis Framework: Provides a unified analysis framework covering a broad range of adaptive methods including AdaGrad, AdaGrad-Norm, and one-sided Shampoo, with core technical contributions being new matrix inequalities (Lemmas 3.3, 3.4) for handling non-commutative preconditioners.
Theoretical Separation: Systematically establishes quantitative separations between the two geometric assumptions (standard vs. adaptive) along both smoothness and noise dimensions, deepening theoretical understanding of adaptivity.

Detailed Methodology

Problem Formulation

Consider the optimization problem: $\min_{x \in \mathbb{R}^d} f(x)$

where $f: \mathbb{R}^d \to \mathbb{R}$ may be non-convex. In the stochastic setting, the objective function is accessed through stochastic gradients $\nabla f_t(x)$ satisfying $\mathbb{E}[\nabla f_t(x)] = \nabla f(x)$ .

Core Concepts

1. Well-Structured Preconditioner Set

Definition 2.1: $\mathcal{H} \subseteq \mathbb{S}_+^d$ is called a well-structured preconditioner set if $\mathcal{H} = \mathbb{S}_+^d \cap \mathcal{K}$ , where $\mathcal{K} \subseteq \mathbb{M}^d$ is a matrix subalgebra containing the identity matrix.

Examples:

Diagonal matrices $\mathcal{D}_+^d$ (corresponding to Adam)
Full PSD matrices $\mathbb{S}_+^d$ (corresponding to full-matrix AdaGrad)
Scalar matrices $\{cI_d: c>0\}$ (corresponding to AdaGrad-Norm)
Kronecker product structure $\mathbb{S}_{d_L}^+ \otimes I_{d_R}$ (corresponding to one-sided Shampoo)

2. Induced Norms and Dual Norms

For a well-structured preconditioner set $\mathcal{H}$ , define the induced norm: $\|x\|_{\mathcal{H}} := \sup_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \|x\|_H = \sup_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \sqrt{x^\top H x}$

Lemma 2.2: The dual norm satisfies $\|x\|_{\mathcal{H},*} = \inf_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \|x\|_{H^{-1}}$

This duality is key to understanding the two geometries: $\|\cdot\|_{\mathcal{H}}$ is the pointwise supremum of all $\|\cdot\|_H$ , while $\|\cdot\|_{\mathcal{H},*}$ is the pointwise infimum of corresponding dual norms.

3. Two Classes of Smoothness

Standard Smoothness (Definition 2.3): $L_{\|\cdot\|}(f) := \min\{L: \|\nabla f(x) - \nabla f(y)\|_* \leq L\|x-y\|, \forall x,y\}$

Adaptive Smoothness (Definition 2.4): $\Lambda_{\mathcal{H}}(f) := \min_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} L_{\|\cdot\|_H}(f) = \min_{H \in \mathcal{H}, \forall x: -H \preceq \nabla^2 f(x) \preceq H} \text{Tr}(H)$

Relationship (Proposition 2.5): $L_{\|\cdot\|_{\mathcal{H}}}(f) \leq \Lambda_{\mathcal{H}}(f) \leq d \cdot L_{\|\cdot\|_{\mathcal{H}}}(f)$

Adaptive smoothness is always at least as strong as standard smoothness, but differs by at most a dimension factor $d$ .

Unified Adaptive Optimizer Framework (Algorithm 1)

Algorithm Structure:

Input: Initial point x₀, learning rate η, preconditioner set H, stability constant ϵ
Initialize: M₋₁ = 0
For t = 0, 1, ..., T-1:
    gₜ ← ∇fₜ(xₜ)
    Mₜ ← Accumulation(Mₜ₋₁, gₜ)  // Three variants
    Vₜ ← argmin_{H∈H} ⟨Mₜ + ϵI, H⁻¹⟩ + Tr(H)
    xₜ₊₁ ← xₜ - ηVₜ⁻¹gₜ
Return x_T

Three Accumulation Variants:

Cumulative: $M_t = M_{t-1} + g_t g_t^\top$ (AdaGrad)
EMA: $M_t = \beta M_{t-1} + (1-\beta)g_t g_t^\top$ (Adam)
Weighted: $M_t = \beta M_{t-1} + g_t g_t^\top$ (for unified analysis)

Key Observation: $V_t = \mathcal{P}_{\mathcal{H}}(M_t + \epsilon I)$ , where $\mathcal{P}_{\mathcal{H}}(M)^2$ is the projection of $M$ onto $\mathcal{H}$ (Lemma A.4).

Technical Innovations

1. New Matrix Inequality (Lemma 3.4)

For positive definite matrices $X, Y$ satisfying $Y \preceq X$ , for any $0 \leq c \leq C$ : $X^{-1/2}(X-Y)X^{-1/2} \preceq \frac{3(\log C - \log c)}{\pi^2}(\log X - \log Y) + \left(\frac{12cd}{\pi^2\lambda_{\min}(X)^2} + \frac{12C^{-1}d}{\pi^2}\right)\text{Tr}(X-Y) \cdot I$

Significance:

When matrices commute, logarithmic telescoping yields tight bounds
In the non-commutative case, the second term quantifies the "non-commutativity cost," introducing a $\log d$ factor
By carefully choosing $c, C$ , the cost is controlled to $\log d$ in Lemma 3.3

2. Second-Order Term Control (Lemma 3.3)

For the weighted algorithm, define $S_T = \sum_{t=0}^{T-1} V_t^{-1}(V_t^2 - \beta V_{t-1}^2)V_t^{-1}$ , then: $\sum_{t=0}^{T-1} \|V_t^{-1}g_t\|_H^2 \leq \text{Tr}(H) \|S_T\|_{\text{op}}$

and there exist constants $C_1, C_2$ such that: $\|S_T\|_{\text{op}} \leq C_1\left(1 + \log\left(1 + \frac{d}{\epsilon}\sum_{t=0}^{T-1}\|g_t\|_2^2 + d^2(1-\beta)T\right)\right)\left(\frac{(1-\beta)T}{\beta} + \log\|V_{T-1}^2/\epsilon\|_{\text{op}}\right) + C_2$

Special Case: When $\mathcal{H}$ is commutative (e.g., diagonal matrices), this improves to $\|S_T\|_{\text{op}} \leq (1-\beta)T + \log\|V_{T-1}^2/\epsilon\|_{\text{op}}$ .

3. Adaptive Gradient Variance (Definition 4.1)

$\sigma_{\mathcal{H}}(\{f_t\})^2 := \min_{H \in \mathcal{H}, \text{Tr}(H) \leq 1} \sup_{t, x} \mathbb{E}[\|\nabla f_t(x) - \mathbb{E}[\nabla f_t(x)]\|_{H^{-1}}^2]$

Relationship (Proposition 4.2): $\sigma_{\|\cdot\|_{\mathcal{H},*}}(\{f_t\})^2 \leq \sigma_{\mathcal{H}}(\{f_t\})^2 \leq d \cdot \sigma_{\|\cdot\|_{\mathcal{H},*}}(\{f_t\})^2$

Intuition: Adaptive variance requires uniform noise control across all geometries induced by preconditioners, which is stronger than controlling noise under a fixed norm.

Experimental Setup

Note: This is a purely theoretical paper without experimental sections. All results are theoretical convergence rates and lower bound proofs.

Theoretical Analysis Setup

Assumptions

Smoothness:
- Standard smoothness: $\|\nabla f(x) - \nabla f(y)\|_{\mathcal{H},*} \leq L_{\|\cdot\|_{\mathcal{H}}}(f)\|x-y\|_{\mathcal{H}}$
- Adaptive smoothness: $\Lambda_{\mathcal{H}}(f) = \min_{H \in \mathcal{H}, \text{Tr}(H)\leq 1} L_{\|\cdot\|_H}(f)$
Noise Assumption (Assumption C.1):
- $\mathbb{E}[\nabla f_t(x)] = \nabla f(x)$
- There exists $\Sigma \succeq 0$ such that $-\Sigma \preceq \nabla f(x)\nabla f(x)^\top - \nabla f_t(x)\nabla f_t(x)^\top \preceq \Sigma$
Convexity: Some results (acceleration) require $f$ to be convex

Analysis Methods

Descent Lemma: Establish single-step descent relations using smoothness
Telescoping Sums: Perform telescoping summation on accumulated terms
Matrix Inequalities: Control second-order terms from preconditioner changes
Probabilistic Methods: Handle stochastic noise through conditional expectation and variance decomposition
Constructive Lower Bounds: Prove tightness through carefully designed hard instances

Experimental Results

Main Theoretical Results

1. Non-convex Convergence Rate (Theorem 3.2)

For cumulative adaptive optimizers (AdaGrad-type) on deterministic non-convex functions: $\frac{1}{T}\sum_{t=0}^{T-1} \|\nabla f(x_t)\|_{\mathcal{H},*} \leq \frac{1}{\sqrt{T}}\left(\xi + \sqrt{d}\epsilon^{1/4}\sqrt{\xi}\right)$

where: $\xi = \tilde{O}\left(\frac{\Delta_0}{\eta} + \eta \cdot \Lambda_{\mathcal{H}}(f) \log^2 d\right)$

Choosing $\eta = \sqrt{\frac{\Delta_0}{\Lambda_{\mathcal{H}}(f)\log^2 d}}$ achieves $\tilde{O}\left(\frac{\sqrt{\Delta_0 \Lambda_{\mathcal{H}}(f)}\log d}{\sqrt{T}}\right)$ .

Key Points:

Convergence rate depends on adaptive smoothness $\Lambda_{\mathcal{H}}(f)$ rather than standard smoothness
Diagonal preconditioners (e.g., Adam) have no $\log d$ factor
General well-structured preconditioners introduce a $\log d$ factor (non-commutativity cost)

2. Accelerated Convergence Rate (Theorem 4.4)

For adaptive optimizers with Nesterov momentum (Algorithm 2), on convex functions with $\alpha_t = \frac{2}{t+2}$ and $\eta = D$ : $\mathbb{E}[f(\bar{x}_T) - f(x^*)] = \tilde{O}\left(\frac{\Lambda_{\mathcal{H}}(f)D^2\log^2 d}{T^2} + \frac{d\sqrt{\epsilon}D}{T^2} + \frac{\sigma_{\mathcal{H}}D\log d}{\sqrt{T}}\right)$

Comparison:

Under adaptive smoothness: $O(T^{-2})$ acceleration rate (deterministic part)
Under standard ℓ∞ smoothness: Guzmán & Nemirovski (2015) proved any first-order method achieves only $\Omega(T^{-1})$

Significance: Demonstrates the substantial advantage of adaptive smoothness—enabling acceleration, which standard smoothness cannot provide.

3. Dimension-Free Convergence Rate (Theorem 4.6)

For NSD (Algorithm 3) under adaptive gradient variance $\sigma_{\mathcal{H}}$ : $\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla f(x_t)\|_{\mathcal{H},*}\right] \leq \frac{\Delta_0}{\eta T} + \frac{2\eta}{\alpha}L_{\|\cdot\|_{\mathcal{H}}}(f) + \frac{2\sigma_{\mathcal{H}}}{\alpha T} + 2\sigma_{\mathcal{H}}\sqrt{\alpha}$

With optimal choices $\alpha = \frac{\sqrt{\Delta_0 L_{\|\cdot\|_{\mathcal{H}}}(f)}}{\sigma_{\mathcal{H}}\sqrt{T}}$ and $\eta = \frac{\Delta_0^{3/4}}{L_{\|\cdot\|_{\mathcal{H}}}(f)^{1/4}\sigma_{\mathcal{H}}^{1/2}}T^{-3/4}$ : $\text{Rate} = O\left(\frac{(\Delta_0 L_{\|\cdot\|_{\mathcal{H}}}(f))^{1/4}\sqrt{\sigma_{\mathcal{H}}}}{T^{1/4}}\right)$

Dimension-Free: Compared to Pethick et al. (2025)'s $\tilde{O}(\rho\sqrt{d}/T^{1/4})$ (where $\rho = \sup_x \frac{\|x\|_{\mathcal{H},*}}{\|x\|_2}$ can reach $\Theta(\sqrt{d})$ ), this result completely eliminates dimension dependence.

4. Dimension Dependence Lower Bound (Theorem 4.9)

Under standard ℓ₁ variance assumption $\mathbb{E}[\|\nabla f_t(x) - \nabla f(x)\|_1^2] \leq \sigma^2$ , for SignGD (ℓ∞-NSD) there exists a hard instance such that: $\mathbb{E}\left[\min_{t \in [T]}\|\nabla f(x_t)\|_1\right] = \min\left\{e^{-25-\frac{1}{4}}(dL\Delta_0\sigma^2)^{1/4}T^{-1/2}, e^{-25-\frac{1}{2}}\sigma\right\}$

Significance:

Achieving error $\epsilon < e^{-25-1/2}\sigma$ requires $T = \Omega(\epsilon^{-2}(dL\Delta_0\sigma^2)^{1/2})$ steps
Dimension dependence $\Omega(d^{1/2})$ is unavoidable under standard variance assumptions
Contrasts with Theorem 4.6's dimension-free upper bound, proving the fundamental superiority of adaptive variance

Key Insights

1. Quantification of Geometric Separation

Relations between the two smoothness and variance classes:

Smoothness: $L_{\|\cdot\|_{\mathcal{H}}}(f) \leq \Lambda_{\mathcal{H}}(f) \leq d \cdot L_{\|\cdot\|_{\mathcal{H}}}(f)$
Variance: $\sigma_{\|\cdot\|_{\mathcal{H},*}}^2 \leq \sigma_{\mathcal{H}}^2 \leq d \cdot \sigma_{\|\cdot\|_{\mathcal{H},*}}^2$

The gap is at most dimension $d$ , but is tight in some cases (e.g., diagonal matrices vs. full matrices).

2. Ineffectiveness of Averaging

Under non-Euclidean geometry, averaging cannot effectively reduce norms:

ℓ₂: $\|\frac{1}{n}\sum_{i=1}^n x_i\|_2 = O(1/\sqrt{n})$ (effective)
ℓ₁: $\|\frac{1}{n}\sum_{i=1}^n x_i\|_1 = O(\sqrt{d}/\sqrt{n})$ (dimension-dependent)

This explains why:

Acceleration requires stronger adaptive smoothness
Momentum cannot eliminate dimension dependence under standard variance

3. Non-Commutativity Cost

General well-structured preconditioners (e.g., one-sided Shampoo) introduce a $\log d$ factor, stemming from:

Matrix non-commutativity preventing direct telescoping
Non-commutative term in Lemma 3.4: $\frac{12cd}{\pi^2\lambda_{\min}^2}\text{Tr}(X-Y) \cdot I$
Careful parameter selection controls the cost to $\log d$

Adaptive Optimizers

Matrix Preconditioning: Shampoo (Gupta et al., 2018) and variants (SOAP, PolarGrad, AdaMuon)
Diagonal Preconditioning: AdaGrad (Duchi et al., 2011), Adam (Kingma & Ba, 2014)
Convergence Theory:
- Convex case: Xie et al. (2025b) establish adaptive smoothness theory
- Diagonal case: Maladkar et al. (2024), Xie et al. (2025a)
Variance Adaptation: Frans et al. (2025) discuss matrix whitening perspective

Normalized Steepest Descent (NSD)

Convergence Analysis:
- Cutkosky & Mehta (2020): O(T⁻³·⁵) non-convex rate
- Pethick et al. (2025), Kovalev (2025b): analysis under general norms
Equivalence:
- Lion = ℓ∞-NSD (Sfyraki & Wang, 2025)
- Muon = spectral norm-NSD (Chen et al., 2025)
Dimension Dependence: Jiang et al. (2025) improvements for SignGD

Acceleration Methods

Mirror Descent Perspective: Kelner et al. (2014), Allen-Zhu & Orecchia (2014)
Adaptive Acceleration: Cutkosky (2019), Kavis et al. (2019), Joulani et al. (2020) for diagonal/coordinate adaptivity
Lower Bounds: Guzmán & Nemirovski (2015) prove Ω(T⁻¹) lower bound under ℓ∞ smoothness

Comparison with This Work

vs Xie et al. (2025b): Extends to non-convex + stochastic settings
vs Kovalev (2025a): Weaker assumptions (avoiding Assumption 4), broader preconditioner classes
vs Pethick et al. (2025): Eliminates dimension dependence through adaptive variance
vs Existing Acceleration Methods: First to prove acceleration under general well-structured preconditioners

Conclusions and Discussion

Main Conclusions

Geometric Duality: Although adaptive optimizers and NSD both exploit non-Euclidean geometry, they depend on fundamentally different geometric assumptions:
- Adaptive Optimizers: Require adaptive smoothness $\Lambda_{\mathcal{H}}(f)$ , automatically adapting to optimal preconditioners
- NSD: Require only standard smoothness $L_{\|\cdot\|_{\mathcal{H}}}(f)$ , but need pre-specified norms
Value of Adaptivity: Stronger adaptive assumptions yield substantial benefits:
- Acceleration: Achieves O(T⁻²) in convex case vs. Ω(T⁻¹) under standard assumptions
- Dimension-Free: Eliminates dimension dependence in stochastic case
Unified Theoretical Framework: First to establish non-convex convergence theory for a broad class of adaptive methods including one-sided Shampoo, with core technique being new matrix inequalities for non-commutative preconditioners.
Tightness: Lower bounds demonstrate:
- Dimension dependence Ω(d^{1/2}) is unavoidable under standard variance (Theorem 4.9)
- Superiority of adaptive variance is not merely a technical artifact but a fundamental distinction

Limitations

Purely Theoretical: Lacks experimental validation of theoretical predictions (e.g., actual convergence behavior under different geometries)
Constant Factors:
- Non-diagonal preconditioners introduce $\log d$ factor (may be insignificant in practice)
- Acceleration algorithm requires knowing $D = \max_t \|x_t - x^*\|_{\mathcal{H}}$ (mitigated through projection version)
Assumption Conditions:
- Assumption C.1 (pointwise covariance upper bound) is stronger than standard expectation assumptions
- Acceleration results require convexity
Applicability:
- How to verify adaptive variance assumption in practice?
- Which real problems satisfy adaptive smoothness?
EMA Analysis: EMA variant requires $\beta = 1 - \Theta(\frac{\log d}{T})$ , while practice typically uses fixed $\beta$ (e.g., 0.9, 0.999)

Future Directions

Experimental Validation:
- Verify adaptive smoothness assumptions on real deep learning tasks
- Compare empirical convergence behavior under different geometries
Relaxing Assumptions:
- Explore weaker noise assumptions (e.g., only expectation bounds)
- Possibility of acceleration in non-convex case
Algorithm Improvements:
- Adaptive selection of preconditioner structure $\mathcal{H}$
- New optimization algorithms leveraging adaptive smoothness
Other Geometries:
- Extension to Bregman divergences, Riemannian geometry
- Other structured preconditioners (sparse, low-rank)
Lower Bound Improvements:
- Lower bounds under adaptive smoothness (currently only under standard smoothness)
- Tighter bounds in non-convex case

In-Depth Evaluation

Strengths

Theoretical Depth:
- First systematic quantitative separation of two geometric assumption classes
- Core matrix inequality (Lemma 3.4) has independent value, potentially applicable to other matrix analysis problems
- Sophisticated proof techniques, particularly for handling non-commutativity
Unification:
- Covers broad range of methods: AdaGrad, Adam, Shampoo, etc.
- Clear equivalence analysis of three accumulation modes (cumulative, EMA, weighted)
- Parallel treatment of smoothness and variance reveals underlying structure
Completeness:
- Upper bounds + lower bounds prove tightness
- Deterministic + stochastic, convex + non-convex coverage
- Detailed technical appendix (48 pages) ensures reproducibility
Insightfulness:
- "Averaging ineffectiveness" explains root of dimension dependence
- Geometric intuition for duality (supremum vs. infimum)
- Precise quantification of non-commutativity cost
Writing Quality:
- Clear structure, introducing concepts through Adam/SignGD examples
- Figure 1 intuitively illustrates duality
- Good balance between technical details and intuition

Weaknesses

Practical Relevance:
- Lacks experimental validation of theoretical predictions
- Universality of adaptive smoothness in real problems unknown
- Is $\log d$ factor significant in practice?
Assumption Strength:
- Assumption C.1 stronger than standard assumptions (almost everywhere)
- Acceleration requires convexity and known $D$
- EMA requires $\beta = 1 - \Theta(\log d / T)$ , inconsistent with practice
Technical Limitations:
- Can the gap between diagonal and general cases (factor $\log d$ ) be eliminated?
- Impossibility of non-convex acceleration not proven
- Lower bounds for adaptive smoothness missing
Presentation Details:
- Õ notation hides logarithmic dependence on multiple parameters (not just $d$ )
- Some constants (e.g., $C_1, C_2$ ) not explicitly stated
- Parameter selection strategy in Lemma 3.4 could be more explicit
Related Work:
- Distinction from parallel work (Kovalev & Borodich, 2025) could be more detailed
- Connection to classical mirror descent theory could be deepened

Impact

Theoretical Contribution:
- Provides new perspective for adaptive optimization theory (hierarchy of geometric assumptions)
- Matrix inequality techniques may influence related fields (matrix analysis, quantum information)
- Unified framework may become standard for future analysis
Practical Value:
- Guides optimizer selection: when to use adaptive methods vs. NSD?
- Inspires new algorithm designs (adaptive $\mathcal{H}$ selection)
- Provides theoretical basis for hyperparameter tuning (e.g., $\beta$ selection)
Reproducibility:
- Purely theoretical work with verifiable results
- Detailed proof techniques enable extension to other settings
- Clear definitions facilitate citation in future research
Limitations:
- Lack of experiments limits immediate impact
- Assumption verification requires follow-up work
- Theory-practice gap needs bridging

Applicable Scenarios

Theoretical Research:
- Convergence analysis of optimization algorithms
- Optimization theory under non-Euclidean geometry
- Theoretical foundations of adaptive methods
Algorithm Design:
- Design guidance for new adaptive optimizers
- Preconditioner structure selection
- Acceleration method improvements
Practical Applications:
- Optimizer selection in large-scale machine learning
- Understanding success of Adam and similar methods
- Theoretical basis for debugging convergence issues
Education:
- Advanced material for optimization theory courses
- Case study for non-Euclidean optimization
- Applications of matrix analysis techniques

Selected Key References

Xie et al. (2025b): "Structured Preconditioners in Adaptive Optimization: A Unified Analysis" - Foundation for convex case
Guzmán & Nemirovski (2015): "On lower complexity bounds for large-scale smooth convex optimization" - Lower bounds under ℓ∞ smoothness
Pethick et al. (2025): "Training deep learning models with norm-constrained lmos" - Recent NSD analysis
Kovalev (2025a): "SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration" - Parallel work
Bernstein & Newhouse (2024): "Old optimizer, new norm: An anthology" - Equivalence between Adam and NSD
Gupta et al. (2017): "A unified approach to adaptive regularization" - Adaptive optimizer framework
Lieb (1973): "Convex trace functions and the wigner-yanase-dyson conjecture" - Concavity basis for Lemma A.7

Summary: This paper represents significant progress in adaptive optimization theory, systematically revealing fundamental differences between adaptive methods and NSD in their geometric assumptions, and rigorously proving the substantial value of adaptivity through theoretical analysis. Despite lacking experimental validation, its theoretical depth and technical innovations make it an important reference in the field. The core contribution lies in establishing a complete theoretical system for "two geometries," providing new perspectives for understanding and designing adaptive optimization algorithms.