2025-11-28T04:49:18.981607

Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration

Sun, Liu, Yuan

Gradient clipping has long been considered essential for ensuring the convergence of Stochastic Gradient Descent (SGD) in the presence of heavy-tailed gradient noise. In this paper, we revisit this belief and explore whether gradient normalization can serve as an effective alternative or complement. We prove that, under individual smoothness assumptions, gradient normalization alone is sufficient to guarantee convergence of the nonconvex SGD. Moreover, when combined with clipping, it yields far better rates of convergence under more challenging noise distributions. We provide a unifying theory describing normalization-only, clipping-only, and combined approaches. Moving forward, we investigate existing variance-reduced algorithms, establishing that, in such a setting, normalization alone is sufficient for convergence. Finally, we present an accelerated variant that under second-order smoothness improves convergence. Our results provide theoretical insights and practical guidance for using normalization and clipping in nonconvex optimization with heavy-tailed noise.

academic

Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration

Basic Information

Paper ID: 2410.16561
Title: Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration
Authors: Tao Sun (National University of Defense Technology), Xinwang Liu (National University of Defense Technology), Kun Yuan (Peking University)
Classification: cs.LG, math.OC, stat.ML
Publication Date/Venue: Journal of Machine Learning Research 26 (2025) 1-42, Submitted 11/24; Revised 9/25; Published 11/25
Paper Link: https://arxiv.org/abs/2410.16561v4

Abstract

This paper revisits the necessity of gradient clipping in convergence guarantees for stochastic gradient descent (SGD) under heavy-tailed noise. The conventional wisdom suggests that gradient clipping is essential for handling heavy-tailed gradient noise. However, this paper proves that under individual smoothness assumptions, gradient normalization alone suffices to guarantee convergence of nonconvex SGD. Furthermore, when normalization is combined with clipping under more challenging noise distributions, improved convergence rates are achieved. The paper provides a unified theoretical framework describing the performance of normalization-only, clipping-only, and combined approaches. The research extends to variance-reduced algorithms, proving that normalization alone is sufficient for convergence, and proposes accelerated variants that improve convergence under second-order smoothness assumptions.

Research Background and Motivation

1. Core Problem to Address

In machine learning optimization, SGD is the primary algorithm for solving nonconvex optimization problems:

$\min_{w \in \mathbb{R}^d} f(w) := \mathbb{E}_{\xi \sim \mathcal{D}}[f(w; \xi)]$

Traditional SGD analysis assumes gradient noise has bounded variance: $\mathbb{E}\|g_t - \nabla f(w_t)\|^2 \leq \sigma^2$ . However, recent research (Zhang et al., 2020; Nguyen et al., 2019) reveals that when training neural networks (particularly language models), this assumption is unrealistic. In practice, gradient noise exhibits heavy-tailed distribution characteristics.

2. Mathematical Definition of Heavy-Tailed Noise

Assumption 1 (Heavy-tailed Noise): There exist constants $\sigma > 0$ and $p \in (1, 2]$ such that:

$\sup_{w \in \mathbb{R}^d} \{\mathbb{E}_{\xi \sim \mathcal{D}}\|\nabla f(w; \xi) - \nabla f(w)\|^p\} \leq \sigma^p$

When $p = 2$ , this degenerates to the standard bounded variance assumption. When $1 < p < 2$ , Zhang et al. (2020) proved that standard SGD fails to converge, highlighting the severity of the problem.

3. Existing Methods and Their Limitations

Mainstream Solutions:

SGDC (Zhang et al., 2020): Uses gradient clipping $\text{Clip}_h(w) := \min\{1, \frac{h}{\|w\|}\}w$
NSGDC (Cutkosky & Mehta, 2021): Combines gradient normalization with clipping
NSGDC-VR (Liu et al., 2023): Variance-reduced version

Limitations:

Necessity of gradient clipping insufficiently questioned: All existing methods employ clipping, but is it truly necessary?
Advantages of combined methods unclear: NSGDC achieves the same convergence rate as SGDC (Liu et al., 2023), failing to demonstrate theoretical advantages of combination
Complex hyperparameter tuning: Clipping introduces additional hyperparameter $h$ , increasing tuning burden

4. Research Motivation

This paper poses three fundamental questions (Q1-Q3):

Q1: Is gradient clipping truly indispensable? Can gradient normalization alone guarantee convergence?

Q2: Does combining normalization with clipping outperform using either technique alone?

Q3: Can NSGDC achieve accelerated convergence under heavy-tailed noise?

Core Contributions

The main contributions of this paper include:

Proving Sufficiency of Gradient Normalization (Answering Q1):
- Proves that gradient normalization used alone guarantees SGD convergence under individual Lipschitz assumptions
- Proposes NSGD and NSGD-VR algorithms without clipping hyperparameters
Improving Convergence Rates of NSGDC/NSGDC-VR (Answering Q2):
- Eliminates logarithmic factors $\ln T$ from previous results
- Proves combined methods significantly outperform clipping-only methods when $\sigma \to 0$
- Achieves optimal convergence rate in expectation: $O(T^{-\frac{p-1}{3p-2}})$
Proposing Accelerated Algorithms (Answering Q3):
- Designs A-NSGDC algorithm leveraging second-order smoothness
- Improves convergence rate from $O(T^{-\frac{p-1}{3p-2}})$ to $O(T^{-\frac{2p-2}{4p-1}})$
Unified Theoretical Framework:
- Provides unified analysis covering normalization, clipping, and combined methods
- Clarifies applicable scenarios and performance boundaries for each method
No Mini-batch Requirements:
- All results require no large batch assumptions, beneficial for generalization

Method Details

Problem Formulation

Optimization Problem: $\min_{w \in \mathbb{R}^d} f(w) = \mathbb{E}_{\xi \sim \mathcal{D}}[f(w; \xi)]$

Objective: Under heavy-tailed noise (Assumption 1), find an $\epsilon$ -approximate first-order stationary point, i.e., $\|\nabla f(w)\| \leq \epsilon$ .

Convergence Metric: $\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\|$

Core Algorithms

1. NSGD (Normalization Only)

Algorithm 4 (NSGD):

Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
    Sample ξₜ ~ D
    mₜ = θmₜ₋₁ + (1-θ)∇f(wₜ; ξₜ)
    wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖

Key Features:

Controls step size through normalization $\frac{m_t}{\|m_t\|}$
No clipping hyperparameter $h$ required
Momentum parameter $\theta$ smooths gradient estimates

2. NSGD-VR (Variance-Reduced Version)

Algorithm 5 (NSGD-VR):

Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
    Sample ξₜ ~ D
    mₜ = θmₜ₋₁ + ∇f(wₜ; ξₜ) - θ∇f(wₜ₋₁; ξₜ)
    wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖

Variance Reduction Mechanism:

Uses the same sample $\xi_t$ to compute $\nabla f(w_t; \xi_t)$ and $\nabla f(w_{t-1}; \xi_t)$
Difference term $\nabla f(w_t; \xi_t) - \theta\nabla f(w_{t-1}; \xi_t)$ reduces variance

3. NSGDC (Normalization + Clipping)

Algorithm 2 (NSGDC):

Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
    Sample unbiased stochastic gradient gₜ
    mₜ = θmₜ₋₁ + (1-θ)Clipₕ(gₜ)
    wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖

Clipping Function: $\text{Clip}_h(w) = \min\{1, \frac{h}{\|w\|}\}w$

4. A-NSGDC (Accelerated Version)

Algorithm 6 (A-NSGDC):

Initialize: w₀ = w₁, m₀ = 0
For t = 1, 2, ...:
    vₜ = wₜ + ζ(wₜ - wₜ₋₁)  # Extrapolation step
    Sample gₜ such that 𝔼gₜ = ∇f(vₜ)
    mₜ = θmₜ₋₁ + (1-θ)Clipₕ(gₜ)
    wₜ₊₁ = wₜ - γ mₜ/‖mₜ‖

Acceleration Mechanism:

Extrapolation point $v_t$ leverages momentum $\zeta = \frac{\theta}{1-\theta}$
Requires second-order Lipschitz assumption (Hessian continuity)

Technical Innovations

1. Key Technical Lemmas

Lemma 7 (Control of Clipped Gradients): If $h \geq 2(\|\nabla f(w_0)\| + L\gamma T)$ , then: $\mathbb{E}\|\text{Clip}_h(g_t) - \mathbb{E}\text{Clip}_h(g_t)\|^2 \leq 10h^{2-p}\sigma^p$ $\|\mathbb{E}\text{Clip}_h(g_t) - \nabla f(w_t)\| \leq 2\sigma^p h^{-(p-1)}$

Lemma 8 (Control of Normalized Gradients): Under individual Lipschitz: $\mathbb{E}_{\xi_t}\|\nabla f(w_t; \xi_t) - \nabla f(w_t)\|^2 \leq 4(B + L\gamma T)^{2-p}\sigma^p$

where $B = \sup_{\xi}\|\nabla f(w_0; \xi)\|$ (gradient bound at initialization).

2. Proof Strategy Innovation

Difficulty with Traditional Methods: Directly controlling $\mathbb{E}\|\text{Clip}_h(g_t) - \nabla f(w_t)\|^2$ is extremely complex, leading to high-probability analysis and logarithmic factors.

Breakthrough in This Work:

Leverages implicit bound from normalization: $\|\nabla f(w_t)\| \leq \|\nabla f(w_0)\| + L\gamma T$
Sets $h \geq 2(\|\nabla f(w_0)\| + L\gamma T)$ to ensure $\|\nabla f(w_t)\| \leq \frac{h}{2}$
Simplifies to expectation analysis, avoiding complex high-probability techniques

3. Individual vs Global Lipschitz

Assumption 2 (Individual Lipschitz): $\|\nabla f(y; \xi) - \nabla f(x; \xi)\| \leq L\|y - x\|, \quad \forall \xi$

Assumption 2' (Global Lipschitz): $\|\nabla f(y) - \nabla f(x)\| \leq L\|y - x\|$

Relationship: Individual Lipschitz $\Rightarrow$ Global Lipschitz (converse does not hold)

Impact:

NSGD/NSGD-VR require individual Lipschitz (to bound $\|\nabla f(w_t; \xi_t)\|$ )
NSGDC/A-NSGDC require only global Lipschitz (clipping provides additional control)

Theoretical Results

Main Theorems

Theorem 1 (NSGD Convergence Rate)

Under Assumptions 1-2, with settings:

$1 - \theta = \min\{\frac{\max\{(L\Delta)^{1/2}, 1\}}{\sigma^{\frac{4p-4}{3p-2}}T^{\frac{p}{3p-2}}}, 1\}$
$\gamma = \sqrt{\frac{\Delta}{L}}\frac{\sqrt{1-\theta}}{\sqrt{T}}$

Then: $\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{(L\Delta)^{1/4}\sigma^{\frac{2p-2}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}}\right)$

Key Insights:

Leading term $O(T^{-\frac{p-1}{3p-2}})$ matches NSGDC
Secondary term $O(T^{-1/2})$ recovers GD rate when $\sigma = 0$
No clipping hyperparameter required

Theorem 2 (NSGD-VR Convergence Rate)

Under Assumptions 1-2, with settings:

$1 - \theta = \min\{\frac{1}{\sigma^{\frac{p}{2p-1}}T^{\frac{p}{2p-1}}}, 1\}$
$\gamma = \frac{4\sqrt{1-\theta}}{L\sqrt{T}}$

Then: $\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{\sigma^{\frac{p}{2p-1}}}{T^{\frac{p-1}{2p-1}}} + \frac{1}{T^{1/2}}\right)$

Improvements:

Exponent $\frac{p-1}{2p-1} > \frac{p-1}{3p-2}$ (variance reduction acceleration)
When $p=2$ : $\frac{1}{3}$ vs $\frac{1}{4}$ (standard vs variance-reduced)
Matches lower bounds (Arjevani et al., 2023)

Theorem 3 (NSGDC Convergence Rate)

Under Assumptions 1, 2', with appropriate hyperparameter settings: $\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{(L\Delta)^{\frac{p-1}{3p-2}}\sigma^{\frac{p}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}}\right)$

Comparison with Prior Work:

Eliminates logarithmic factors: Liu et al. (2023) has $\ln T$ terms, this work does not
Improved noise dependence: $\sigma^{\frac{p}{3p-2}}$ vs $\sigma$ (former is smaller when $p < 2$ )
Recovers deterministic case: $O(T^{-1/2})$ when $\sigma = 0$

Theorem 5 (A-NSGDC Accelerated Convergence)

Under Assumptions 1, 2', 3 (second-order Lipschitz): $\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(w_t)\| = O\left(\frac{\sigma^{4/7}}{T^{\frac{2p-2}{4p-1}}} + \frac{1}{T^{1/2}}\right)$

Acceleration Effect:

Exponent $\frac{2p-2}{4p-1} > \frac{p-1}{3p-2}$
When $p=2$ : $\frac{2}{7}$ vs $\frac{1}{4}$ (accelerated vs standard)
Requires Hessian Lipschitz continuity

Comparative Analysis (Table 1 Summary)

Algorithm	Paper	Convergence Rate	Assumptions
SGDC	Zhang et al. (2020)	$O(T^{-\frac{p-1}{3p-2}} + T^{-\frac{2p-p^2}{3p-2}}\sigma^{\frac{2p^2}{3p-2}})$	GL
NSGDC	Liu et al. (2023)	$O(\max\{\frac{\sigma \ln T}{T^{\frac{p-1}{3p-2}}}, \frac{1}{T^{\frac{p-1}{3p-2}}}\})$	GL
NSGD	This work Thm 2	$O(\frac{\sigma^{\frac{2p-2}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}})$	IL
NSGDC	This work Thm 3	$O(\frac{\sigma^{\frac{p}{3p-2}}}{T^{\frac{p-1}{3p-2}}} + \frac{1}{T^{1/2}})$	GL

GL: Global Lipschitz, IL: Individual Lipschitz

Experimental Setup

Note: This is a purely theoretical work with no experimental section. All results are theoretical proofs.

Theoretical Verification Methods

Matching Lower Bounds: Proves convergence rates achieve known lower bounds (Carmon et al., 2020)
Recovery of Special Cases:
- Recovers standard SGD results when $p = 2$
- Recovers gradient descent rate when $\sigma = 0$
Comparison with Existing Results: Demonstrates improvements through theoretical analysis

Theoretical Analysis and Insights

1. Analysis of Clipping Necessity

Conclusion: Clipping is not necessary but beneficial

Evidence:

Sufficiency: Theorem 1 proves normalization alone is sufficient (under IL)
Acceleration: Theorem 3 proves combined methods improve noise dependence
Trade-off: Clipping adds hyperparameter but relaxes smoothness assumption (GL vs IL)

Scenario Division:

Use normalization alone: Individual smoothness satisfied, no clipping parameter tuning needed
Use combined approach: Only global smoothness satisfied, optimal noise dependence desired

2. Improvement in Noise Dependence

Key Observation: Combined methods show significant advantages when $\sigma$ is small

Quantitative Analysis ( $p = 1.5$ example):

SGDC: $O(\sigma)$
NSGDC: $O(\sigma^{1/2})$
Improvement factor: $\sqrt{\sigma}$ (tends to infinity as $\sigma \to 0$ )

3. Impact of Mini-batch

This Work's Results: No mini-batch assumption required

Comparison with Concurrent Work:

Hübler et al. (2024): Requires specific mini-batch size
This work: Batch size = 1 suffices

Practical Significance: Small batches benefit generalization (Keskar et al., 2017)

4. Expectation vs High Probability

This Work's Choice: Expectation analysis

Advantages:

Avoids $\ln T$ , $\ln(1/\delta)$ factors
Simpler proofs
More flexible hyperparameter selection

Limitations: High-probability guarantees are stronger (but incur logarithmic costs)

1. SGD under Heavy-Tailed Noise

Zhang et al. (2020): First proves SGDC convergence, rate $O(T^{-\frac{p-1}{3p-2}})$
Cutkosky & Mehta (2021): NSGDC high-probability results with $\ln T$ factors
Liu et al. (2023): NSGDC-VR, eliminates some logarithmic factors
Nguyen et al. (2023): Improved high-probability bounds for SGDC

2. Nonconvex Variance Reduction

Johnson & Zhang (2013): SVRG (convex case)
Zhou et al. (2020): Nested variance reduction (nonconvex)
Cutkosky & Orabona (2019): STORM algorithm
Fang et al. (2018): SPIDER algorithm

3. Second-Order Smoothness Acceleration

Allen-Zhu (2018): Natasha 2
Tripuraneni et al. (2018): Stochastic cubic regularization
Cutkosky & Mehta (2020b): Normalized acceleration

4. Concurrent Work

Hübler et al. (2024): Gradient normalization (requires mini-batch)
Liu & Zhou (2024): Gradient normalization + momentum

Differences from This Work:

No mini-batch requirement
Unified framework (normalization, clipping, combination)
Better noise dependence (in specific parameter ranges)

Conclusions and Discussion

Main Conclusions

Gradient clipping is not necessary: Normalization alone guarantees convergence (under individual smoothness)
Combined methods have advantages: Improved noise dependence, eliminated logarithmic factors
Variance reduction compatible: Normalization alone suffices, clipping unnecessary
Acceleration feasible: Achieves $O(T^{-\frac{2p-2}{4p-1}})$ under second-order smoothness

Theoretical Contributions

Unified Perspective: Clarifies that clipping plays an "acceleration" rather than "necessity" role
Tight Bound Analysis: Recovers deterministic case, proves analysis tightness
Expectation Framework: Simplifies proofs, provides clear hyperparameter guidance

Limitations

Theoretical Work: Lacks experimental validation of actual performance
Assumption Restrictions:
- NSGD requires individual Lipschitz (stronger)
- Acceleration requires second-order Lipschitz (even stronger)
- Bounded gradient at initialization (Assumption 2 condition (2))
Variance Reduction + Acceleration Unresolved: Cannot combine under second-order smoothness
Hidden Constants: Constants in theoretical bounds may be large

Future Directions

Experimental Validation: Test theoretical predictions on real deep learning tasks
Relaxing Assumptions: Explore weaker smoothness conditions
Variance Reduction + Acceleration: Overcome technical barriers for combination
Adaptive Methods: Automatically adjust $\theta$ , $\gamma$ and other parameters
Distributed Settings: Extend to communication-limited scenarios

Open Questions

Q: Can NSGD convergence be proven under global Lipschitz?

Concurrent work (Liu & Zhou, 2024) gives affirmative answer, but requires mini-batch
Global Lipschitz result without mini-batch remains open

Q: Can expectation bounds be converted to high-probability bounds without significant loss?

May require new concentration inequality techniques

In-Depth Evaluation

Strengths

1. Theoretical Rigor

Complete Proofs: Appendix provides detailed proofs for all theorems (42 pages)
Tight Bound Analysis: Verifies analysis tightness by recovering deterministic case
Technical Innovation: Technique of simplifying high-probability analysis to expectation analysis

2. Unified Framework

Systematic Comparison: Table 1 clearly compares all methods
Clear Applicable Scenarios: Trade-off between individual vs global Lipschitz
Logical Structure: Q1-Q3 questions guide the entire paper

3. Practical Significance

Simplified Implementation: NSGD requires no clipping parameter tuning
No Mini-batch Requirement: Beneficial for generalization
Improved Noise Dependence: Significant advantages when $\sigma$ is small

4. Writing Quality

Clear Motivation: Three fundamental questions guide the narrative
Technical Explanation: Section 2.2 concisely explains improvements
Comprehensive Related Work: Detailed comparison with concurrent work

Weaknesses

1. Lack of Experiments

Purely Theoretical: No validation on actual neural network training
Unknown Constants: Hidden constants in theoretical bounds may affect practicality
Parameter Sensitivity: Robustness of parameter selection not studied

2. Assumption Restrictions

Strong Individual Lipschitz: Many practical problems only satisfy global Lipschitz
Initialization Condition: $B = \sup_{\xi}\|\nabla f(w_0; \xi)\| < \infty$ needs verification
Rare Second-Order Smoothness: Hessian Lipschitz difficult to verify in practice

3. Technical Limitations

Variance Reduction + Acceleration Failure: Acknowledged inability to combine (Sec 5 end)
Missing High-Probability Bounds: Expectation results weaker than high-probability guarantees
Incomplete Lower Bounds: Optimality of $\sigma^{\frac{p}{3p-2}}$ dependence not proven

4. Competition with Concurrent Work

Liu & Zhou (2024): Proves NSGD under global Lipschitz, more general
Hübler et al. (2024): Provides high-probability bounds, stronger
This work's advantages mainly in no mini-batch requirement and specific noise dependence ranges

Impact Assessment

Contribution to the Field

Concept Clarification: Clarifies that clipping is "acceleration" rather than "necessity"
Theoretical Tools: Expectation analysis framework may inspire subsequent work
Benchmark Results: Provides detailed convergence rate comparisons (Table 1)

Practical Value

Moderate: Theory guides practice, but lacks experimental validation
Hyperparameter Selection: Provides explicit parameter setting formulas
Algorithm Simplification: NSGD reduces tuning burden

Reproducibility

Theory: Complete proofs, easy to verify
Algorithms: Clear pseudocode (Algorithms 1-7)
Implementation: No public code (purely theoretical work)

Applicable Scenarios

Recommended NSGD Use Cases

Individual Lipschitz satisfied (e.g., finite-sum optimization)
Unwilling to tune clipping parameters
Small-batch training (generalization priority)

Recommended NSGDC Use Cases

Only global Lipschitz satisfied
Noise level $\sigma$ unknown or large
Optimal noise dependence needed

Recommended NSGD-VR Use Cases

Individual Lipschitz satisfied
Finite-sum problems (can compute individual gradients)
Fastest convergence desired ( $O(T^{-1/3})$ when $p=2$ )

Recommended A-NSGDC Use Cases

Second-order Lipschitz satisfied
Can afford additional computation (extrapolation step)
Further acceleration needed

Suggestions for Future Research

For Researchers

Experimental Validation: Test on ImageNet, language models, etc.
Relaxing Assumptions: Explore weaker smoothness (e.g., Hölder continuity)
Adaptive Algorithms: Design parameter adjustment strategies without prior knowledge

For Practitioners

Try NSGD First: Simple with theoretical guarantees
Monitor Gradient Norms: Verify $\|\nabla f(w_t; \xi_t)\|$ boundedness
Small-Batch Training: Avoid large batches harming generalization

Selected References

Zhang et al. (2020): "Adaptive Gradient Methods with Dynamic Bound of Learning Rate" - Original SGDC paper
Cutkosky & Mehta (2021): "Momentum Improves Normalized SGD" - NSGDC high-probability analysis
Liu et al. (2023): "Breaking the Lower Bound with (Little) Structure" - NSGDC-VR
Arjevani et al. (2023): "Lower Bounds for Non-Convex Stochastic Optimization" - Lower bound theory
Carmon et al. (2020): "Lower Bounds for Finding Stationary Points I" - Lower bounds under individual smoothness

Summary

This paper conducts in-depth theoretical research on gradient control techniques for SGD under heavy-tailed noise, with the core contribution being proof that gradient clipping is not necessary but beneficial. Through introducing a simplified expectation analysis framework, the authors improve existing results, eliminate logarithmic factors, and recover the deterministic case. Despite lacking experimental validation and having assumption restrictions, the unified theoretical perspective and clear scenario divisions provided by this work have important value for understanding and designing robust optimization algorithms. Notably, the simplicity and theoretical guarantees of the NSGD algorithm make it worth trying in practice. Future work should focus on experimental validation, assumption relaxation, and adaptive algorithm design.