Gradient clipping has long been considered essential for ensuring the convergence of Stochastic Gradient Descent (SGD) in the presence of heavy-tailed gradient noise. In this paper, we revisit this belief and explore whether gradient normalization can serve as an effective alternative or complement. We prove that, under individual smoothness assumptions, gradient normalization alone is sufficient to guarantee convergence of the nonconvex SGD. Moreover, when combined with clipping, it yields far better rates of convergence under more challenging noise distributions. We provide a unifying theory describing normalization-only, clipping-only, and combined approaches. Moving forward, we investigate existing variance-reduced algorithms, establishing that, in such a setting, normalization alone is sufficient for convergence. Finally, we present an accelerated variant that under second-order smoothness improves convergence. Our results provide theoretical insights and practical guidance for using normalization and clipping in nonconvex optimization with heavy-tailed noise.
Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration Paper ID : 2410.16561Title : Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and AccelerationAuthors : Tao Sun (National University of Defense Technology), Xinwang Liu (National University of Defense Technology), Kun Yuan (Peking University)Classification : cs.LG, math.OC, stat.MLPublication Date/Venue : Journal of Machine Learning Research 26 (2025) 1-42, Submitted 11/24; Revised 9/25; Published 11/25Paper Link : https://arxiv.org/abs/2410.16561v4 This paper revisits the necessity of gradient clipping in convergence guarantees for stochastic gradient descent (SGD) under heavy-tailed noise. The conventional wisdom suggests that gradient clipping is essential for handling heavy-tailed gradient noise. However, this paper proves that under individual smoothness assumptions, gradient normalization alone suffices to guarantee convergence of nonconvex SGD . Furthermore, when normalization is combined with clipping under more challenging noise distributions, improved convergence rates are achieved. The paper provides a unified theoretical framework describing the performance of normalization-only, clipping-only, and combined approaches. The research extends to variance-reduced algorithms, proving that normalization alone is sufficient for convergence, and proposes accelerated variants that improve convergence under second-order smoothness assumptions.
In machine learning optimization, SGD is the primary algorithm for solving nonconvex optimization problems:
min w ∈ R d f ( w ) : = E ξ ∼ D [ f ( w ; ξ ) ] \min_{w \in \mathbb{R}^d} f(w) := \mathbb{E}_{\xi \sim \mathcal{D}}[f(w; \xi)] min w ∈ R d f ( w ) := E ξ ∼ D [ f ( w ; ξ )]
Traditional SGD analysis assumes gradient noise has bounded variance : E ∥ g t − ∇ f ( w t ) ∥ 2 ≤ σ 2 \mathbb{E}\|g_t - \nabla f(w_t)\|^2 \leq \sigma^2 E ∥ g t − ∇ f ( w t ) ∥ 2 ≤ σ 2 . However, recent research (Zhang et al., 2020; Nguyen et al., 2019) reveals that when training neural networks (particularly language models), this assumption is unrealistic. In practice, gradient noise exhibits heavy-tailed distribution characteristics.
Assumption 1 (Heavy-tailed Noise) : There exist constants σ > 0 \sigma > 0 σ > 0 and p ∈ ( 1 , 2 ] p \in (1, 2] p ∈ ( 1 , 2 ] such that:
sup w ∈ R d { E ξ ∼ D ∥ ∇ f ( w ; ξ ) − ∇ f ( w ) ∥ p } ≤ σ p \sup_{w \in \mathbb{R}^d} \{\mathbb{E}_{\xi \sim \mathcal{D}}\|\nabla f(w; \xi) - \nabla f(w)\|^p\} \leq \sigma^p sup w ∈ R d { E ξ ∼ D ∥∇ f ( w ; ξ ) − ∇ f ( w ) ∥ p } ≤ σ p
When p = 2 p = 2 p = 2 , this degenerates to the standard bounded variance assumption. When 1 < p < 2 1 < p < 2 1 < p < 2 , Zhang et al. (2020) proved that standard SGD fails to converge , highlighting the severity of the problem.
Mainstream Solutions :
SGDC (Zhang et al., 2020): Uses gradient clipping Clip h ( w ) : = min { 1 , h ∥ w ∥ } w \text{Clip}_h(w) := \min\{1, \frac{h}{\|w\|}\}w Clip h ( w ) := min { 1 , ∥ w ∥ h } w NSGDC (Cutkosky & Mehta, 2021): Combines gradient normalization with clippingNSGDC-VR (Liu et al., 2023): Variance-reduced versionLimitations :
Necessity of gradient clipping insufficiently questioned : All existing methods employ clipping, but is it truly necessary?Advantages of combined methods unclear : NSGDC achieves the same convergence rate as SGDC (Liu et al., 2023), failing to demonstrate theoretical advantages of combinationComplex hyperparameter tuning : Clipping introduces additional hyperparameter h h h , increasing tuning burdenThis paper poses three fundamental questions (Q1-Q3):
Q1 : Is gradient clipping truly indispensable? Can gradient normalization alone guarantee convergence?
Q2 : Does combining normalization with clipping outperform using either technique alone?
Q3 : Can NSGDC achieve accelerated convergence under heavy-tailed noise?
The main contributions of this paper include:
Proving Sufficiency of Gradient Normalization (Answering Q1) :Proves that gradient normalization used alone guarantees SGD convergence under individual Lipschitz assumptions Proposes NSGD and NSGD-VR algorithms without clipping hyperparameters Improving Convergence Rates of NSGDC/NSGDC-VR (Answering Q2) :Eliminates logarithmic factors ln T \ln T ln T from previous results Proves combined methods significantly outperform clipping-only methods when σ → 0 \sigma \to 0 σ → 0 Achieves optimal convergence rate in expectation: O ( T − p − 1 3 p − 2 ) O(T^{-\frac{p-1}{3p-2}}) O ( T − 3 p − 2 p − 1 ) Proposing Accelerated Algorithms (Answering Q3) :Designs A-NSGDC algorithm leveraging second-order smoothness Improves convergence rate from O ( T − p − 1 3 p − 2 ) O(T^{-\frac{p-1}{3p-2}}) O ( T − 3 p − 2 p − 1 ) to O ( T − 2 p − 2 4 p − 1 ) O(T^{-\frac{2p-2}{4p-1}}) O ( T − 4 p − 1 2 p − 2 ) Unified Theoretical Framework :Provides unified analysis covering normalization, clipping, and combined methods Clarifies applicable scenarios and performance boundaries for each method No Mini-batch Requirements :All results require no large batch assumptions, beneficial for generalization Optimization Problem :
min w ∈ R d f ( w ) = E ξ ∼ D [ f ( w ; ξ ) ] \min_{w \in \mathbb{R}^d} f(w) = \mathbb{E}_{\xi \sim \mathcal{D}}[f(w; \xi)] min w ∈ R d f ( w ) = E ξ ∼ D [ f ( w ; ξ )]
Objective : Under heavy-tailed noise (Assumption 1), find an ϵ \epsilon ϵ -approximate first-order stationary point, i.e., ∥ ∇ f ( w ) ∥ ≤ ϵ \|\nabla f(w)\| \leq \epsilon ∥∇ f ( w ) ∥ ≤ ϵ .
Convergence Metric : 1 T ∑ t = 1 T E ∥ ∇ f ( w t ) ∥ \frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| T 1 ∑ t = 1 T E ∥∇ f ( w t ) ∥
Algorithm 4 (NSGD) :
Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
Sample ξₜ ~ D
mₜ = θmₜ₋₁ + (1-θ)∇f(wₜ; ξₜ)
wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖
Key Features :
Controls step size through normalization m t ∥ m t ∥ \frac{m_t}{\|m_t\|} ∥ m t ∥ m t No clipping hyperparameter h h h required Momentum parameter θ \theta θ smooths gradient estimates Algorithm 5 (NSGD-VR) :
Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
Sample ξₜ ~ D
mₜ = θmₜ₋₁ + ∇f(wₜ; ξₜ) - θ∇f(wₜ₋₁; ξₜ)
wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖
Variance Reduction Mechanism :
Uses the same sample ξ t \xi_t ξ t to compute ∇ f ( w t ; ξ t ) \nabla f(w_t; \xi_t) ∇ f ( w t ; ξ t ) and ∇ f ( w t − 1 ; ξ t ) \nabla f(w_{t-1}; \xi_t) ∇ f ( w t − 1 ; ξ t ) Difference term ∇ f ( w t ; ξ t ) − θ ∇ f ( w t − 1 ; ξ t ) \nabla f(w_t; \xi_t) - \theta\nabla f(w_{t-1}; \xi_t) ∇ f ( w t ; ξ t ) − θ ∇ f ( w t − 1 ; ξ t ) reduces variance Algorithm 2 (NSGDC) :
Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
Sample unbiased stochastic gradient gₜ
mₜ = θmₜ₋₁ + (1-θ)Clipₕ(gₜ)
wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖
Clipping Function : Clip h ( w ) = min { 1 , h ∥ w ∥ } w \text{Clip}_h(w) = \min\{1, \frac{h}{\|w\|}\}w Clip h ( w ) = min { 1 , ∥ w ∥ h } w
Algorithm 6 (A-NSGDC) :
Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
vₜ = wₜ + ζ(wₜ - wₜ₋₁) # Extrapolation step
Sample gₜ such that 𝔼gₜ = ∇f(vₜ)
mₜ = θmₜ₋₁ + (1-θ)Clipₕ(gₜ)
wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖
Acceleration Mechanism :
Extrapolation point v t v_t v t leverages momentum ζ = θ 1 − θ \zeta = \frac{\theta}{1-\theta} ζ = 1 − θ θ Requires second-order Lipschitz assumption (Hessian continuity) Lemma 7 (Control of Clipped Gradients): If h ≥ 2 ( ∥ ∇ f ( w 0 ) ∥ + L γ T ) h \geq 2(\|\nabla f(w_0)\| + L\gamma T) h ≥ 2 ( ∥∇ f ( w 0 ) ∥ + L γ T ) , then:
E ∥ Clip h ( g t ) − E Clip h ( g t ) ∥ 2 ≤ 10 h 2 − p σ p \mathbb{E}\|\text{Clip}_h(g_t) - \mathbb{E}\text{Clip}_h(g_t)\|^2 \leq 10h^{2-p}\sigma^p E ∥ Clip h ( g t ) − E Clip h ( g t ) ∥ 2 ≤ 10 h 2 − p σ p ∥ E Clip h ( g t ) − ∇ f ( w t ) ∥ ≤ 2 σ p h − ( p − 1 ) \|\mathbb{E}\text{Clip}_h(g_t) - \nabla f(w_t)\| \leq 2\sigma^p h^{-(p-1)} ∥ E Clip h ( g t ) − ∇ f ( w t ) ∥ ≤ 2 σ p h − ( p − 1 )
Lemma 8 (Control of Normalized Gradients): Under individual Lipschitz:
E ξ t ∥ ∇ f ( w t ; ξ t ) − ∇ f ( w t ) ∥ 2 ≤ 4 ( B + L γ T ) 2 − p σ p \mathbb{E}_{\xi_t}\|\nabla f(w_t; \xi_t) - \nabla f(w_t)\|^2 \leq 4(B + L\gamma T)^{2-p}\sigma^p E ξ t ∥∇ f ( w t ; ξ t ) − ∇ f ( w t ) ∥ 2 ≤ 4 ( B + L γ T ) 2 − p σ p
where B = sup ξ ∥ ∇ f ( w 0 ; ξ ) ∥ B = \sup_{\xi}\|\nabla f(w_0; \xi)\| B = sup ξ ∥∇ f ( w 0 ; ξ ) ∥ (gradient bound at initialization).
Difficulty with Traditional Methods : Directly controlling E ∥ Clip h ( g t ) − ∇ f ( w t ) ∥ 2 \mathbb{E}\|\text{Clip}_h(g_t) - \nabla f(w_t)\|^2 E ∥ Clip h ( g t ) − ∇ f ( w t ) ∥ 2 is extremely complex, leading to high-probability analysis and logarithmic factors.
Breakthrough in This Work :
Leverages implicit bound from normalization: ∥ ∇ f ( w t ) ∥ ≤ ∥ ∇ f ( w 0 ) ∥ + L γ T \|\nabla f(w_t)\| \leq \|\nabla f(w_0)\| + L\gamma T ∥∇ f ( w t ) ∥ ≤ ∥∇ f ( w 0 ) ∥ + L γ T Sets h ≥ 2 ( ∥ ∇ f ( w 0 ) ∥ + L γ T ) h \geq 2(\|\nabla f(w_0)\| + L\gamma T) h ≥ 2 ( ∥∇ f ( w 0 ) ∥ + L γ T ) to ensure ∥ ∇ f ( w t ) ∥ ≤ h 2 \|\nabla f(w_t)\| \leq \frac{h}{2} ∥∇ f ( w t ) ∥ ≤ 2 h Simplifies to expectation analysis, avoiding complex high-probability techniques Assumption 2 (Individual Lipschitz) :
∥ ∇ f ( y ; ξ ) − ∇ f ( x ; ξ ) ∥ ≤ L ∥ y − x ∥ , ∀ ξ \|\nabla f(y; \xi) - \nabla f(x; \xi)\| \leq L\|y - x\|, \quad \forall \xi ∥∇ f ( y ; ξ ) − ∇ f ( x ; ξ ) ∥ ≤ L ∥ y − x ∥ , ∀ ξ
Assumption 2' (Global Lipschitz) :
∥ ∇ f ( y ) − ∇ f ( x ) ∥ ≤ L ∥ y − x ∥ \|\nabla f(y) - \nabla f(x)\| \leq L\|y - x\| ∥∇ f ( y ) − ∇ f ( x ) ∥ ≤ L ∥ y − x ∥
Relationship : Individual Lipschitz ⇒ \Rightarrow ⇒ Global Lipschitz (converse does not hold)
Impact :
NSGD/NSGD-VR require individual Lipschitz (to bound ∥ ∇ f ( w t ; ξ t ) ∥ \|\nabla f(w_t; \xi_t)\| ∥∇ f ( w t ; ξ t ) ∥ ) NSGDC/A-NSGDC require only global Lipschitz (clipping provides additional control) Under Assumptions 1-2, with settings:
1 − θ = min { max { ( L Δ ) 1 / 2 , 1 } σ 4 p − 4 3 p − 2 T p 3 p − 2 , 1 } 1 - \theta = \min\{\frac{\max\{(L\Delta)^{1/2}, 1\}}{\sigma^{\frac{4p-4}{3p-2}}T^{\frac{p}{3p-2}}}, 1\} 1 − θ = min { σ 3 p − 2 4 p − 4 T 3 p − 2 p m a x {( L Δ ) 1/2 , 1 } , 1 } γ = Δ L 1 − θ T \gamma = \sqrt{\frac{\Delta}{L}}\frac{\sqrt{1-\theta}}{\sqrt{T}} γ = L Δ T 1 − θ Then:
1 T ∑ t = 1 T E ∥ ∇ f ( w t ) ∥ = O ( ( L Δ ) 1 / 4 σ 2 p − 2 3 p − 2 T p − 1 3 p − 2 + 1 T 1 / 2 ) \frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{(L\Delta)^{1/4}\sigma^{\frac{2p-2}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}}\right) T 1 ∑ t = 1 T E ∥∇ f ( w t ) ∥ = O ( T 3 p − 2 p − 1 ( L Δ ) 1/4 σ 3 p − 2 2 p − 2 + T 1/2 1 )
Key Insights :
Leading term O ( T − p − 1 3 p − 2 ) O(T^{-\frac{p-1}{3p-2}}) O ( T − 3 p − 2 p − 1 ) matches NSGDC Secondary term O ( T − 1 / 2 ) O(T^{-1/2}) O ( T − 1/2 ) recovers GD rate when σ = 0 \sigma = 0 σ = 0 No clipping hyperparameter required Under Assumptions 1-2, with settings:
1 − θ = min { 1 σ p 2 p − 1 T p 2 p − 1 , 1 } 1 - \theta = \min\{\frac{1}{\sigma^{\frac{p}{2p-1}}T^{\frac{p}{2p-1}}}, 1\} 1 − θ = min { σ 2 p − 1 p T 2 p − 1 p 1 , 1 } γ = 4 1 − θ L T \gamma = \frac{4\sqrt{1-\theta}}{L\sqrt{T}} γ = L T 4 1 − θ Then:
1 T ∑ t = 1 T E ∥ ∇ f ( w t ) ∥ = O ( σ p 2 p − 1 T p − 1 2 p − 1 + 1 T 1 / 2 ) \frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{\sigma^{\frac{p}{2p-1}}}{T^{\frac{p-1}{2p-1}}} + \frac{1}{T^{1/2}}\right) T 1 ∑ t = 1 T E ∥∇ f ( w t ) ∥ = O ( T 2 p − 1 p − 1 σ 2 p − 1 p + T 1/2 1 )
Improvements :
Exponent p − 1 2 p − 1 > p − 1 3 p − 2 \frac{p-1}{2p-1} > \frac{p-1}{3p-2} 2 p − 1 p − 1 > 3 p − 2 p − 1 (variance reduction acceleration) When p = 2 p=2 p = 2 : 1 3 \frac{1}{3} 3 1 vs 1 4 \frac{1}{4} 4 1 (standard vs variance-reduced) Matches lower bounds (Arjevani et al., 2023) Under Assumptions 1, 2', with appropriate hyperparameter settings:
1 T ∑ t = 1 T E ∥ ∇ f ( w t ) ∥ = O ( ( L Δ ) p − 1 3 p − 2 σ p 3 p − 2 T p − 1 3 p − 2 + 1 T 1 / 2 ) \frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{(L\Delta)^{\frac{p-1}{3p-2}}\sigma^{\frac{p}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}}\right) T 1 ∑ t = 1 T E ∥∇ f ( w t ) ∥ = O ( T 3 p − 2 p − 1 ( L Δ ) 3 p − 2 p − 1 σ 3 p − 2 p + T 1/2 1 )
Comparison with Prior Work :
Eliminates logarithmic factors : Liu et al. (2023) has ln T \ln T ln T terms, this work does notImproved noise dependence : σ p 3 p − 2 \sigma^{\frac{p}{3p-2}} σ 3 p − 2 p vs σ \sigma σ (former is smaller when p < 2 p < 2 p < 2 )Recovers deterministic case : O ( T − 1 / 2 ) O(T^{-1/2}) O ( T − 1/2 ) when σ = 0 \sigma = 0 σ = 0 Under Assumptions 1, 2', 3 (second-order Lipschitz):
1 T ∑ t = 1 T E ∥ ∇ f ( w t ) ∥ = O ( σ 4 / 7 T 2 p − 2 4 p − 1 + 1 T 1 / 2 ) \frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{\sigma^{4/7}}{T^{\frac{2p-2}{4p-1}}} + \frac{1}{T^{1/2}}\right) T 1 ∑ t = 1 T E ∥∇ f ( w t ) ∥ = O ( T 4 p − 1 2 p − 2 σ 4/7 + T 1/2 1 )
Acceleration Effect :
Exponent 2 p − 2 4 p − 1 > p − 1 3 p − 2 \frac{2p-2}{4p-1} > \frac{p-1}{3p-2} 4 p − 1 2 p − 2 > 3 p − 2 p − 1 When p = 2 p=2 p = 2 : 2 7 \frac{2}{7} 7 2 vs 1 4 \frac{1}{4} 4 1 (accelerated vs standard) Requires Hessian Lipschitz continuity Algorithm Paper Convergence Rate Assumptions SGDC Zhang et al. (2020) O ( T − p − 1 3 p − 2 + T − 2 p − p 2 3 p − 2 σ 2 p 2 3 p − 2 ) O(T^{-\frac{p-1}{3p-2}} + T^{-\frac{2p-p^2}{3p-2}}\sigma^{\frac{2p^2}{3p-2}}) O ( T − 3 p − 2 p − 1 + T − 3 p − 2 2 p − p 2 σ 3 p − 2 2 p 2 ) GL NSGDC Liu et al. (2023) O ( max { σ ln T T p − 1 3 p − 2 , 1 T p − 1 3 p − 2 } ) O(\max\{\frac{\sigma \ln T}{T^{\frac{p-1}{3p-2}}}, \frac{1}{T^{\frac{p-1}{3p-2}}}\}) O ( max { T 3 p − 2 p − 1 σ l n T , T 3 p − 2 p − 1 1 }) GL NSGD This work Thm 2 O ( σ 2 p − 2 3 p − 2 T p − 1 3 p − 2 + 1 T 1 / 2 ) O(\frac{\sigma^{\frac{2p-2}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}}) O ( T 3 p − 2 p − 1 σ 3 p − 2 2 p − 2 + T 1/2 1 ) IL NSGDC This work Thm 3 O ( σ p 3 p − 2 T p − 1 3 p − 2 + 1 T 1 / 2 ) O(\frac{\sigma^{\frac{p}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}}) O ( T 3 p − 2 p − 1 σ 3 p − 2 p + T 1/2 1 ) GL
GL : Global Lipschitz, IL : Individual Lipschitz
Note : This is a purely theoretical work with no experimental section. All results are theoretical proofs.
Matching Lower Bounds : Proves convergence rates achieve known lower bounds (Carmon et al., 2020)Recovery of Special Cases :
Recovers standard SGD results when p = 2 p = 2 p = 2 Recovers gradient descent rate when σ = 0 \sigma = 0 σ = 0 Comparison with Existing Results : Demonstrates improvements through theoretical analysisConclusion : Clipping is not necessary but beneficial
Evidence :
Sufficiency : Theorem 1 proves normalization alone is sufficient (under IL)Acceleration : Theorem 3 proves combined methods improve noise dependenceTrade-off : Clipping adds hyperparameter but relaxes smoothness assumption (GL vs IL)Scenario Division :
Use normalization alone : Individual smoothness satisfied, no clipping parameter tuning neededUse combined approach : Only global smoothness satisfied, optimal noise dependence desiredKey Observation : Combined methods show significant advantages when σ \sigma σ is small
Quantitative Analysis (p = 1.5 p = 1.5 p = 1.5 example):
SGDC: O ( σ ) O(\sigma) O ( σ ) NSGDC: O ( σ 1 / 2 ) O(\sigma^{1/2}) O ( σ 1/2 ) Improvement factor: σ \sqrt{\sigma} σ (tends to infinity as σ → 0 \sigma \to 0 σ → 0 ) This Work's Results : No mini-batch assumption required
Comparison with Concurrent Work :
Hübler et al. (2024): Requires specific mini-batch size This work: Batch size = 1 suffices Practical Significance : Small batches benefit generalization (Keskar et al., 2017)
This Work's Choice : Expectation analysis
Advantages :
Avoids ln T \ln T ln T , ln ( 1 / δ ) \ln(1/\delta) ln ( 1/ δ ) factors Simpler proofs More flexible hyperparameter selection Limitations : High-probability guarantees are stronger (but incur logarithmic costs)
Zhang et al. (2020) : First proves SGDC convergence, rate O ( T − p − 1 3 p − 2 ) O(T^{-\frac{p-1}{3p-2}}) O ( T − 3 p − 2 p − 1 ) Cutkosky & Mehta (2021) : NSGDC high-probability results with ln T \ln T ln T factorsLiu et al. (2023) : NSGDC-VR, eliminates some logarithmic factorsNguyen et al. (2023) : Improved high-probability bounds for SGDCJohnson & Zhang (2013) : SVRG (convex case)Zhou et al. (2020) : Nested variance reduction (nonconvex)Cutkosky & Orabona (2019) : STORM algorithmFang et al. (2018) : SPIDER algorithmAllen-Zhu (2018) : Natasha 2Tripuraneni et al. (2018) : Stochastic cubic regularizationCutkosky & Mehta (2020b) : Normalized accelerationHübler et al. (2024) : Gradient normalization (requires mini-batch)Liu & Zhou (2024) : Gradient normalization + momentumDifferences from This Work :
No mini-batch requirement Unified framework (normalization, clipping, combination) Better noise dependence (in specific parameter ranges) Gradient clipping is not necessary : Normalization alone guarantees convergence (under individual smoothness)Combined methods have advantages : Improved noise dependence, eliminated logarithmic factorsVariance reduction compatible : Normalization alone suffices, clipping unnecessaryAcceleration feasible : Achieves O ( T − 2 p − 2 4 p − 1 ) O(T^{-\frac{2p-2}{4p-1}}) O ( T − 4 p − 1 2 p − 2 ) under second-order smoothnessUnified Perspective : Clarifies that clipping plays an "acceleration" rather than "necessity" roleTight Bound Analysis : Recovers deterministic case, proves analysis tightnessExpectation Framework : Simplifies proofs, provides clear hyperparameter guidanceTheoretical Work : Lacks experimental validation of actual performanceAssumption Restrictions :
NSGD requires individual Lipschitz (stronger) Acceleration requires second-order Lipschitz (even stronger) Bounded gradient at initialization (Assumption 2 condition (2)) Variance Reduction + Acceleration Unresolved : Cannot combine under second-order smoothnessHidden Constants : Constants in theoretical bounds may be largeExperimental Validation : Test theoretical predictions on real deep learning tasksRelaxing Assumptions : Explore weaker smoothness conditionsVariance Reduction + Acceleration : Overcome technical barriers for combinationAdaptive Methods : Automatically adjust θ \theta θ , γ \gamma γ and other parametersDistributed Settings : Extend to communication-limited scenariosQ : Can NSGD convergence be proven under global Lipschitz?
Concurrent work (Liu & Zhou, 2024) gives affirmative answer, but requires mini-batch Global Lipschitz result without mini-batch remains open Q : Can expectation bounds be converted to high-probability bounds without significant loss?
May require new concentration inequality techniques Complete Proofs : Appendix provides detailed proofs for all theorems (42 pages)Tight Bound Analysis : Verifies analysis tightness by recovering deterministic caseTechnical Innovation : Technique of simplifying high-probability analysis to expectation analysisSystematic Comparison : Table 1 clearly compares all methodsClear Applicable Scenarios : Trade-off between individual vs global LipschitzLogical Structure : Q1-Q3 questions guide the entire paperSimplified Implementation : NSGD requires no clipping parameter tuningNo Mini-batch Requirement : Beneficial for generalizationImproved Noise Dependence : Significant advantages when σ \sigma σ is smallClear Motivation : Three fundamental questions guide the narrativeTechnical Explanation : Section 2.2 concisely explains improvementsComprehensive Related Work : Detailed comparison with concurrent workPurely Theoretical : No validation on actual neural network trainingUnknown Constants : Hidden constants in theoretical bounds may affect practicalityParameter Sensitivity : Robustness of parameter selection not studiedStrong Individual Lipschitz : Many practical problems only satisfy global LipschitzInitialization Condition : B = sup ξ ∥ ∇ f ( w 0 ; ξ ) ∥ < ∞ B = \sup_{\xi}\|\nabla f(w_0; \xi)\| < \infty B = sup ξ ∥∇ f ( w 0 ; ξ ) ∥ < ∞ needs verificationRare Second-Order Smoothness : Hessian Lipschitz difficult to verify in practiceVariance Reduction + Acceleration Failure : Acknowledged inability to combine (Sec 5 end)Missing High-Probability Bounds : Expectation results weaker than high-probability guaranteesIncomplete Lower Bounds : Optimality of σ p 3 p − 2 \sigma^{\frac{p}{3p-2}} σ 3 p − 2 p dependence not provenLiu & Zhou (2024) : Proves NSGD under global Lipschitz, more generalHübler et al. (2024) : Provides high-probability bounds, strongerThis work's advantages mainly in no mini-batch requirement and specific noise dependence ranges Concept Clarification : Clarifies that clipping is "acceleration" rather than "necessity"Theoretical Tools : Expectation analysis framework may inspire subsequent workBenchmark Results : Provides detailed convergence rate comparisons (Table 1)Moderate : Theory guides practice, but lacks experimental validationHyperparameter Selection : Provides explicit parameter setting formulasAlgorithm Simplification : NSGD reduces tuning burdenTheory : Complete proofs, easy to verifyAlgorithms : Clear pseudocode (Algorithms 1-7)Implementation : No public code (purely theoretical work)Individual Lipschitz satisfied (e.g., finite-sum optimization) Unwilling to tune clipping parameters Small-batch training (generalization priority) Only global Lipschitz satisfied Noise level σ \sigma σ unknown or large Optimal noise dependence needed Individual Lipschitz satisfied Finite-sum problems (can compute individual gradients) Fastest convergence desired (O ( T − 1 / 3 ) O(T^{-1/3}) O ( T − 1/3 ) when p = 2 p=2 p = 2 ) Second-order Lipschitz satisfied Can afford additional computation (extrapolation step) Further acceleration needed Experimental Validation : Test on ImageNet, language models, etc.Relaxing Assumptions : Explore weaker smoothness (e.g., Hölder continuity)Adaptive Algorithms : Design parameter adjustment strategies without prior knowledgeTry NSGD First : Simple with theoretical guaranteesMonitor Gradient Norms : Verify ∥ ∇ f ( w t ; ξ t ) ∥ \|\nabla f(w_t; \xi_t)\| ∥∇ f ( w t ; ξ t ) ∥ boundednessSmall-Batch Training : Avoid large batches harming generalizationZhang et al. (2020) : "Adaptive Gradient Methods with Dynamic Bound of Learning Rate" - Original SGDC paperCutkosky & Mehta (2021) : "Momentum Improves Normalized SGD" - NSGDC high-probability analysisLiu et al. (2023) : "Breaking the Lower Bound with (Little) Structure" - NSGDC-VRArjevani et al. (2023) : "Lower Bounds for Non-Convex Stochastic Optimization" - Lower bound theoryCarmon et al. (2020) : "Lower Bounds for Finding Stationary Points I" - Lower bounds under individual smoothnessThis paper conducts in-depth theoretical research on gradient control techniques for SGD under heavy-tailed noise, with the core contribution being proof that gradient clipping is not necessary but beneficial . Through introducing a simplified expectation analysis framework, the authors improve existing results, eliminate logarithmic factors, and recover the deterministic case. Despite lacking experimental validation and having assumption restrictions, the unified theoretical perspective and clear scenario divisions provided by this work have important value for understanding and designing robust optimization algorithms. Notably, the simplicity and theoretical guarantees of the NSGD algorithm make it worth trying in practice. Future work should focus on experimental validation, assumption relaxation, and adaptive algorithm design.