Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
Kondo, Iiduka
We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
academic
Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
This paper analyzes the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules by introducing a novel and simplified Lyapunov function. The research extends existing theoretical frameworks to cover three practical scheduling strategies commonly used in deep learning: constant batch size with decaying learning rate, increasing batch size with decaying learning rate, and simultaneously increasing both batch size and learning rate. The results reveal a clear convergence hierarchy: constant batch size cannot guarantee convergence of the expected gradient norm, while increasing batch size can, and simultaneously increasing both batch size and learning rate achieves provably faster decay. Experimental results validate the theory, demonstrating that SGDM with dynamic schedules significantly outperforms fixed hyperparameter counterparts in convergence speed.
The core problem addressed in this research is: how to theoretically guide dynamic scheduling of learning rate and batch size in SGDM to achieve better convergence performance.
Practical Demand: Dynamic learning rate scheduling (e.g., cosine annealing) is widely adopted in deep learning training but lacks theoretical support
Efficiency Improvement: Increasing batch size has been reported to improve mini-batch SGD efficiency, but theoretical analysis under the SGDM framework is limited
Theoretical Gap: Existing SGDM theoretical analysis is primarily limited to fixed learning rates; a theoretical framework for dynamic scheduling is urgently needed
Novel Lyapunov Function: Proposes a simplified Lyapunov function adapted to dynamic learning rate scheduling, more concise than existing methods
Unified Theoretical Framework: Establishes a unified analysis framework covering both SHB and NSHB, applicable to various scheduling strategies
Theoretical Extension: Extends the analysis of Kamo and Iiduka (2025) from constant learning rate to decaying learning rate, and studies the case of simultaneously increasing both learning rate and batch size
Convergence Hierarchy: Theoretically proves the convergence performance ordering of four scheduling strategies and validates through experiments
Studies the empirical risk minimization problem: minθ∈Rdf(θ)=n1∑i=1nfi(θ), where fi(θ)=f(θ;(xi,yi)) is the loss function. The goal is to find a stationary point θ∗∈Rd such that ∇f(θ∗)=0.
Compared to existing methods (e.g., the complex form in Liu et al. 2020), this paper's Lyapunov function is concise in form and naturally adapts to dynamic learning rates.
Through careful selection of the definition of At, successfully eliminates the cross-term E[⟨∇f(θt),mt−1⟩] in the analysis, which is the key technical challenge of this analysis.
Under increasing batch size scheduling, SGD, NSHB, and SHB show rapid gradient norm decrease in early stages, but Adam achieves smaller gradient norms in later stages.
Compared to existing work, this paper provides the first complete theoretical framework for SGDM dynamic learning rate scheduling, filling an important theoretical gap.
Liu, Y., Gao, Y., and Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum
Umeda, H. and Iiduka, H. (2025). Increasing both batch size and learning rate accelerates stochastic gradient descent
Kamo, K. and Iiduka, H. (2025). Increasing batch size improves convergence of stochastic gradient descent with momentum
Smith, S. L., Kindermans, P.-J., and Le, Q. V. (2018). Don't decay the learning rate, increase the batch size
Overall Assessment: This is a paper with solid theoretical contributions that successfully analyzes the dynamic scheduling problem of SGDM through a simplified Lyapunov function. While the innovation is relatively limited, it fills an important theoretical gap and provides valuable guidance for practical applications. The theoretical analysis is rigorous and experimental validation is sufficient, making it a beneficial contribution to the field of optimization theory.