2025-11-20T19:04:15.290366

Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Kondo, Iiduka

We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.

academic

Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Basic Information

Paper ID: 2508.03105
Title: Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
Authors: Yuichi Kondo, Hideaki Iiduka (Meiji University)
Classification: cs.LG (Machine Learning)
Publication Date: October 10, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2508.03105v2

Abstract

This paper analyzes the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules by introducing a novel and simplified Lyapunov function. The research extends existing theoretical frameworks to cover three practical scheduling strategies commonly used in deep learning: constant batch size with decaying learning rate, increasing batch size with decaying learning rate, and simultaneously increasing both batch size and learning rate. The results reveal a clear convergence hierarchy: constant batch size cannot guarantee convergence of the expected gradient norm, while increasing batch size can, and simultaneously increasing both batch size and learning rate achieves provably faster decay. Experimental results validate the theory, demonstrating that SGDM with dynamic schedules significantly outperforms fixed hyperparameter counterparts in convergence speed.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is: how to theoretically guide dynamic scheduling of learning rate and batch size in SGDM to achieve better convergence performance.

Significance

Practical Demand: Dynamic learning rate scheduling (e.g., cosine annealing) is widely adopted in deep learning training but lacks theoretical support
Efficiency Improvement: Increasing batch size has been reported to improve mini-batch SGD efficiency, but theoretical analysis under the SGDM framework is limited
Theoretical Gap: Existing SGDM theoretical analysis is primarily limited to fixed learning rates; a theoretical framework for dynamic scheduling is urgently needed

Limitations of Existing Methods

Umeda and Iiduka (2025): Only analyzes dynamic scheduling for vanilla SGD, not momentum methods
Kamo and Iiduka (2025): Studies SGDM convergence with constant learning rate and increasing batch size, but does not consider dynamic learning rate
Liu et al. (2020): Analyzes NSHB under fixed learning rate, but extension to dynamic scheduling remains challenging

Research Motivation

To fill the theoretical analysis gap for SGDM dynamic learning rate scheduling and provide theoretical guidance for practical training.

Core Contributions

Novel Lyapunov Function: Proposes a simplified Lyapunov function adapted to dynamic learning rate scheduling, more concise than existing methods
Unified Theoretical Framework: Establishes a unified analysis framework covering both SHB and NSHB, applicable to various scheduling strategies
Theoretical Extension: Extends the analysis of Kamo and Iiduka (2025) from constant learning rate to decaying learning rate, and studies the case of simultaneously increasing both learning rate and batch size
Convergence Hierarchy: Theoretically proves the convergence performance ordering of four scheduling strategies and validates through experiments

Methodology Details

Task Definition

Studies the empirical risk minimization problem: $\min_{\theta \in \mathbb{R}^d} f(\theta) = \frac{1}{n}\sum_{i=1}^n f_i(\theta)$ , where $f_i(\theta) = f(\theta; (x_i, y_i))$ is the loss function. The goal is to find a stationary point $\theta^* \in \mathbb{R}^d$ such that $\nabla f(\theta^*) = 0$ .

Theoretical Framework

Lyapunov Function Design

Proposes a new Lyapunov function: $L_t := \begin{cases} f(\theta_t), & t = 0 \\ f(\theta_t) + A_{t-1}\|m_{t-1}\|^2, & t > 0 \end{cases}$

where $A_t \geq 0$ is a deterministic scalar depending only on $t$ . For the NSHB method: $A_t := \frac{\eta_t - L(1-\beta)\eta_t^2}{2(1-\beta)}$

Algorithm Description

NSHB Algorithm:

m_t = βm_{t-1} + (1-β)∇f_{B_t}(θ_t)
θ_{t+1} = θ_t - η_t m_t

SHB Algorithm:

m_t = βm_{t-1} + ∇f_{B_t}(θ_t)
θ_{t+1} = θ_t - α_t m_t

Technical Innovations

1. Simplified Lyapunov Function

Compared to existing methods (e.g., the complex form in Liu et al. 2020), this paper's Lyapunov function is concise in form and naturally adapts to dynamic learning rates.

2. Unified Analysis Framework

By introducing the technical condition $\frac{\lambda_{t+1}}{\lambda_t} \leq c$ (where $1 \leq c < \frac{1}{\beta^2}$ ), simultaneously handles both decreasing and increasing learning rate schedules.

3. Cross-Term Elimination Technique

Through careful selection of the definition of $A_t$ , successfully eliminates the cross-term $E[\langle\nabla f(\theta_t), m_{t-1}\rangle]$ in the analysis, which is the key technical challenge of this analysis.

Experimental Setup

Datasets

Dataset: CIFAR-100
Model: ResNet-18
Training Epochs: 300
Momentum Coefficient: β = 0.9

Hardware Environment

CPU: Dual Intel Xeon Silver 4316
GPU: NVIDIA Tesla A100 80GB
Software: Python 3.8.2, CUDA 12.2, PyTorch 2.4.1

Scheduling Strategies

Studies four training schedules:

Constant Batch Size + Decaying Learning Rate: Batch size fixed at 128
Increasing Batch Size + Decaying Learning Rate: Batch size doubles every 30 epochs (2³ to 2¹²)
Increasing Batch Size + Increasing Learning Rate: Both batch size and learning rate grow simultaneously
Increasing Batch Size + Warm-up Learning Rate: Learning rate schedule that increases then decreases

Evaluation Metrics

Training loss
Test accuracy
Full gradient norm $\|\nabla f(\theta_e)\|$

Experimental Results

Main Theoretical Results

Theorem 1: Unified Convergence Bound

Under the assumption conditions, for both NSHB and SHB: $\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|^2] \leq 2C_{alg}(f(\theta_0) - f^*)B_T + \sigma^2 V_T$

where:

$B_T = \frac{1}{\sum_{t=0}^{T-1}\lambda_t}$
$V_T = \frac{1}{\sum_{t=0}^{T-1}\lambda_t}\sum_{t=0}^{T-1}\frac{\lambda_t}{b_t}$
$C_{alg} = (1-\beta)^{-1}$ (NSHB), $C_{alg} = 1$ (SHB)

Convergence Rate Analysis

Constant Batch Size Case: $\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|] = O\left(\sqrt{\frac{1}{T} + \frac{1}{b}}\right)$

Increasing Batch Size Case: $\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|] = O\left(\frac{1}{\sqrt{T}}\right)$

Simultaneously Increasing Batch Size and Learning Rate: $\min_{0 \leq t \leq T-1} E[\|\nabla f(\theta_t)\|] = O\left(\frac{1}{\gamma^{M/2}}\right)$

Experimental Validation

Convergence Performance Ordering

Experimental results completely validate the theoretically predicted convergence hierarchy:

Worst: Constant batch size + decaying learning rate
Moderate: Increasing batch size + decaying learning rate
Better: Increasing batch size + increasing learning rate
Optimal: Increasing batch size + warm-up learning rate

Specific Numerical Results

NSHB and SHB exhibit the same ordering in gradient norm convergence
The warm-up strategy also achieves the best test accuracy performance
For SHB, high learning rate achieves faster gradient norm decay, but low learning rate achieves better test accuracy

Comparison with Other Optimizers

Under increasing batch size scheduling, SGD, NSHB, and SHB show rapid gradient norm decrease in early stages, but Adam achieves smaller gradient norms in later stages.

Theoretical Analysis of Momentum Methods

Liu et al. (2020): Pioneering work on NSHB with fixed learning rate
Gadat et al. (2018), Mai and Johansson (2020): Convergence analysis based on Lyapunov functions
Wilson et al. (2021), Defazio (2021): Theoretical analysis of accelerated methods

Learning Rate and Batch Size Scheduling

Umeda and Iiduka (2025): Dynamic scheduling analysis for vanilla SGD
Kamo and Iiduka (2025): Analysis of SGDM with increasing batch size
Smith et al. (2018): Effectiveness of batch size scheduling in practice

Advantages of This Work

Compared to existing work, this paper provides the first complete theoretical framework for SGDM dynamic learning rate scheduling, filling an important theoretical gap.

Conclusions and Discussion

Main Conclusions

Theoretical Contribution: Establishes a complete theoretical framework for SGDM dynamic scheduling
Convergence Hierarchy: Proves that increasing batch size outperforms constant batch size, with simultaneously increasing both being optimal
Experimental Validation: Theoretical predictions are highly consistent with experimental results

Limitations

Assumption Conditions: Requires L-smoothness and bounded variance assumptions
Learning Rate Constraints: The technical condition $\frac{\lambda_{t+1}}{\lambda_t} \leq c < \frac{1}{\beta^2}$ limits the learning rate growth speed
Experimental Scope: Validation only on CIFAR-100 and ResNet-18, lacking large-scale experiments

Future Directions

Momentum Coefficient Scheduling: Extension to dynamic scheduling of momentum coefficient $\beta$
Other Optimizers: Extension of analysis to adaptive methods like Adam
Practical Applications: Validation on larger-scale deep learning tasks

In-Depth Evaluation

Strengths

Theoretical Rigor: Lyapunov function design is ingenious with rigorous mathematical derivations
Practical Value: Provides theoretical guidance for hyperparameter scheduling in practical training
Unified Framework: Simultaneously analyzes SHB and NSHB with good generality
Sufficient Experiments: High consistency between theoretical and experimental results, enhancing credibility

Weaknesses

Limited Innovation: Primarily an extension of existing techniques with relatively limited core innovation
Experimental Scale: Experiments limited to medium-scale problems, lacking large-scale validation
Practical Constraints: Technical conditions in theoretical analysis may be difficult to strictly satisfy in practice
Insufficient Comparison: Lacks in-depth comparison with latest adaptive optimization methods

Impact

Theoretical Value: Provides important theoretical foundation for SGDM dynamic scheduling
Practical Significance: Guides hyperparameter setting in practical deep learning training
Reproducibility: Code is publicly available, experiments are reproducible

Applicable Scenarios

Deep Learning Training: Particularly suitable for scenarios requiring fine-grained learning rate and batch size scheduling
Theoretical Research: Provides foundation for further optimization theory research
Engineering Practice: Guides automatic hyperparameter adjustment in practical training systems

References

Liu, Y., Gao, Y., and Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum
Umeda, H. and Iiduka, H. (2025). Increasing both batch size and learning rate accelerates stochastic gradient descent
Kamo, K. and Iiduka, H. (2025). Increasing batch size improves convergence of stochastic gradient descent with momentum
Smith, S. L., Kindermans, P.-J., and Le, Q. V. (2018). Don't decay the learning rate, increase the batch size

Overall Assessment: This is a paper with solid theoretical contributions that successfully analyzes the dynamic scheduling problem of SGDM through a simplified Lyapunov function. While the innovation is relatively limited, it fills an important theoretical gap and provides valuable guidance for practical applications. The theoretical analysis is rigorous and experimental validation is sufficient, making it a beneficial contribution to the field of optimization theory.