2025-11-20T19:04:15.290366

Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Kondo, Iiduka
We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
academic

Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Basic Information

  • Paper ID: 2508.03105
  • Title: Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
  • Authors: Yuichi Kondo, Hideaki Iiduka (Meiji University)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: October 10, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2508.03105v2

Abstract

This paper analyzes the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules by introducing a novel and simplified Lyapunov function. The research extends existing theoretical frameworks to cover three practical scheduling strategies commonly used in deep learning: constant batch size with decaying learning rate, increasing batch size with decaying learning rate, and simultaneously increasing both batch size and learning rate. The results reveal a clear convergence hierarchy: constant batch size cannot guarantee convergence of the expected gradient norm, while increasing batch size can, and simultaneously increasing both batch size and learning rate achieves provably faster decay. Experimental results validate the theory, demonstrating that SGDM with dynamic schedules significantly outperforms fixed hyperparameter counterparts in convergence speed.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is: how to theoretically guide dynamic scheduling of learning rate and batch size in SGDM to achieve better convergence performance.

Significance

  1. Practical Demand: Dynamic learning rate scheduling (e.g., cosine annealing) is widely adopted in deep learning training but lacks theoretical support
  2. Efficiency Improvement: Increasing batch size has been reported to improve mini-batch SGD efficiency, but theoretical analysis under the SGDM framework is limited
  3. Theoretical Gap: Existing SGDM theoretical analysis is primarily limited to fixed learning rates; a theoretical framework for dynamic scheduling is urgently needed

Limitations of Existing Methods

  1. Umeda and Iiduka (2025): Only analyzes dynamic scheduling for vanilla SGD, not momentum methods
  2. Kamo and Iiduka (2025): Studies SGDM convergence with constant learning rate and increasing batch size, but does not consider dynamic learning rate
  3. Liu et al. (2020): Analyzes NSHB under fixed learning rate, but extension to dynamic scheduling remains challenging

Research Motivation

To fill the theoretical analysis gap for SGDM dynamic learning rate scheduling and provide theoretical guidance for practical training.

Core Contributions

  1. Novel Lyapunov Function: Proposes a simplified Lyapunov function adapted to dynamic learning rate scheduling, more concise than existing methods
  2. Unified Theoretical Framework: Establishes a unified analysis framework covering both SHB and NSHB, applicable to various scheduling strategies
  3. Theoretical Extension: Extends the analysis of Kamo and Iiduka (2025) from constant learning rate to decaying learning rate, and studies the case of simultaneously increasing both learning rate and batch size
  4. Convergence Hierarchy: Theoretically proves the convergence performance ordering of four scheduling strategies and validates through experiments

Methodology Details

Task Definition

Studies the empirical risk minimization problem: minθRdf(θ)=1ni=1nfi(θ)\min_{\theta \in \mathbb{R}^d} f(\theta) = \frac{1}{n}\sum_{i=1}^n f_i(\theta), where fi(θ)=f(θ;(xi,yi))f_i(\theta) = f(\theta; (x_i, y_i)) is the loss function. The goal is to find a stationary point θRd\theta^* \in \mathbb{R}^d such that f(θ)=0\nabla f(\theta^*) = 0.

Theoretical Framework

Lyapunov Function Design

Proposes a new Lyapunov function:

f(\theta_t), & t = 0 \\ f(\theta_t) + A_{t-1}\|m_{t-1}\|^2, & t > 0 \end{cases}$$ where $A_t \geq 0$ is a deterministic scalar depending only on $t$. For the NSHB method: $$A_t := \frac{\eta_t - L(1-\beta)\eta_t^2}{2(1-\beta)}$$ #### Algorithm Description **NSHB Algorithm**: ``` m_t = βm_{t-1} + (1-β)∇f_{B_t}(θ_t) θ_{t+1} = θ_t - η_t m_t ``` **SHB Algorithm**: ``` m_t = βm_{t-1} + ∇f_{B_t}(θ_t) θ_{t+1} = θ_t - α_t m_t ``` ### Technical Innovations #### 1. Simplified Lyapunov Function Compared to existing methods (e.g., the complex form in Liu et al. 2020), this paper's Lyapunov function is concise in form and naturally adapts to dynamic learning rates. #### 2. Unified Analysis Framework By introducing the technical condition $\frac{\lambda_{t+1}}{\lambda_t} \leq c$ (where $1 \leq c < \frac{1}{\beta^2}$), simultaneously handles both decreasing and increasing learning rate schedules. #### 3. Cross-Term Elimination Technique Through careful selection of the definition of $A_t$, successfully eliminates the cross-term $E[\langle\nabla f(\theta_t), m_{t-1}\rangle]$ in the analysis, which is the key technical challenge of this analysis. ## Experimental Setup ### Datasets - **Dataset**: CIFAR-100 - **Model**: ResNet-18 - **Training Epochs**: 300 - **Momentum Coefficient**: β = 0.9 ### Hardware Environment - **CPU**: Dual Intel Xeon Silver 4316 - **GPU**: NVIDIA Tesla A100 80GB - **Software**: Python 3.8.2, CUDA 12.2, PyTorch 2.4.1 ### Scheduling Strategies Studies four training schedules: 1. **Constant Batch Size + Decaying Learning Rate**: Batch size fixed at 128 2. **Increasing Batch Size + Decaying Learning Rate**: Batch size doubles every 30 epochs (2³ to 2¹²) 3. **Increasing Batch Size + Increasing Learning Rate**: Both batch size and learning rate grow simultaneously 4. **Increasing Batch Size + Warm-up Learning Rate**: Learning rate schedule that increases then decreases ### Evaluation Metrics - Training loss - Test accuracy - Full gradient norm $\|\nabla f(\theta_e)\|$ ## Experimental Results ### Main Theoretical Results #### Theorem 1: Unified Convergence Bound Under the assumption conditions, for both NSHB and SHB: $$\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|^2] \leq 2C_{alg}(f(\theta_0) - f^*)B_T + \sigma^2 V_T$$ where: - $B_T = \frac{1}{\sum_{t=0}^{T-1}\lambda_t}$ - $V_T = \frac{1}{\sum_{t=0}^{T-1}\lambda_t}\sum_{t=0}^{T-1}\frac{\lambda_t}{b_t}$ - $C_{alg} = (1-\beta)^{-1}$ (NSHB), $C_{alg} = 1$ (SHB) #### Convergence Rate Analysis **Constant Batch Size Case**: $$\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|] = O\left(\sqrt{\frac{1}{T} + \frac{1}{b}}\right)$$ **Increasing Batch Size Case**: $$\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|] = O\left(\frac{1}{\sqrt{T}}\right)$$ **Simultaneously Increasing Batch Size and Learning Rate**: $$\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|] = O\left(\frac{1}{\gamma^{M/2}}\right)$$ ### Experimental Validation #### Convergence Performance Ordering Experimental results completely validate the theoretically predicted convergence hierarchy: 1. **Worst**: Constant batch size + decaying learning rate 2. **Moderate**: Increasing batch size + decaying learning rate 3. **Better**: Increasing batch size + increasing learning rate 4. **Optimal**: Increasing batch size + warm-up learning rate #### Specific Numerical Results - NSHB and SHB exhibit the same ordering in gradient norm convergence - The warm-up strategy also achieves the best test accuracy performance - For SHB, high learning rate achieves faster gradient norm decay, but low learning rate achieves better test accuracy #### Comparison with Other Optimizers Under increasing batch size scheduling, SGD, NSHB, and SHB show rapid gradient norm decrease in early stages, but Adam achieves smaller gradient norms in later stages. ## Related Work ### Theoretical Analysis of Momentum Methods - **Liu et al. (2020)**: Pioneering work on NSHB with fixed learning rate - **Gadat et al. (2018), Mai and Johansson (2020)**: Convergence analysis based on Lyapunov functions - **Wilson et al. (2021), Defazio (2021)**: Theoretical analysis of accelerated methods ### Learning Rate and Batch Size Scheduling - **Umeda and Iiduka (2025)**: Dynamic scheduling analysis for vanilla SGD - **Kamo and Iiduka (2025)**: Analysis of SGDM with increasing batch size - **Smith et al. (2018)**: Effectiveness of batch size scheduling in practice ### Advantages of This Work Compared to existing work, this paper provides the first complete theoretical framework for SGDM dynamic learning rate scheduling, filling an important theoretical gap. ## Conclusions and Discussion ### Main Conclusions 1. **Theoretical Contribution**: Establishes a complete theoretical framework for SGDM dynamic scheduling 2. **Convergence Hierarchy**: Proves that increasing batch size outperforms constant batch size, with simultaneously increasing both being optimal 3. **Experimental Validation**: Theoretical predictions are highly consistent with experimental results ### Limitations 1. **Assumption Conditions**: Requires L-smoothness and bounded variance assumptions 2. **Learning Rate Constraints**: The technical condition $\frac{\lambda_{t+1}}{\lambda_t} \leq c < \frac{1}{\beta^2}$ limits the learning rate growth speed 3. **Experimental Scope**: Validation only on CIFAR-100 and ResNet-18, lacking large-scale experiments ### Future Directions 1. **Momentum Coefficient Scheduling**: Extension to dynamic scheduling of momentum coefficient $\beta$ 2. **Other Optimizers**: Extension of analysis to adaptive methods like Adam 3. **Practical Applications**: Validation on larger-scale deep learning tasks ## In-Depth Evaluation ### Strengths 1. **Theoretical Rigor**: Lyapunov function design is ingenious with rigorous mathematical derivations 2. **Practical Value**: Provides theoretical guidance for hyperparameter scheduling in practical training 3. **Unified Framework**: Simultaneously analyzes SHB and NSHB with good generality 4. **Sufficient Experiments**: High consistency between theoretical and experimental results, enhancing credibility ### Weaknesses 1. **Limited Innovation**: Primarily an extension of existing techniques with relatively limited core innovation 2. **Experimental Scale**: Experiments limited to medium-scale problems, lacking large-scale validation 3. **Practical Constraints**: Technical conditions in theoretical analysis may be difficult to strictly satisfy in practice 4. **Insufficient Comparison**: Lacks in-depth comparison with latest adaptive optimization methods ### Impact 1. **Theoretical Value**: Provides important theoretical foundation for SGDM dynamic scheduling 2. **Practical Significance**: Guides hyperparameter setting in practical deep learning training 3. **Reproducibility**: Code is publicly available, experiments are reproducible ### Applicable Scenarios 1. **Deep Learning Training**: Particularly suitable for scenarios requiring fine-grained learning rate and batch size scheduling 2. **Theoretical Research**: Provides foundation for further optimization theory research 3. **Engineering Practice**: Guides automatic hyperparameter adjustment in practical training systems ## References - Liu, Y., Gao, Y., and Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum - Umeda, H. and Iiduka, H. (2025). Increasing both batch size and learning rate accelerates stochastic gradient descent - Kamo, K. and Iiduka, H. (2025). Increasing batch size improves convergence of stochastic gradient descent with momentum - Smith, S. L., Kindermans, P.-J., and Le, Q. V. (2018). Don't decay the learning rate, increase the batch size --- **Overall Assessment**: This is a paper with solid theoretical contributions that successfully analyzes the dynamic scheduling problem of SGDM through a simplified Lyapunov function. While the innovation is relatively limited, it fills an important theoretical gap and provides valuable guidance for practical applications. The theoretical analysis is rigorous and experimental validation is sufficient, making it a beneficial contribution to the field of optimization theory.