This paper provides an in-depth investigation into the theoretical limitations of momentum in federated learning and decentralized optimization. Although recent research has explored using momentum in local methods to enhance distributed SGD, particularly in federated learning to mitigate the effects of statistical heterogeneity, it remains unclear whether momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios with partial client participation. Through theoretical analysis of cyclic client participation patterns, this paper proves that momentum is inevitably affected by statistical heterogeneity. Furthermore, decreasing step sizes do not help: any schedule decaying faster than Θ(1/t) leads to convergence to a constant value dependent on initialization and heterogeneity bounds. Numerical experiments and deep learning experiments validate the theoretical findings and their relevance to practical scenarios.
The core problem addressed in this paper is: Can classical momentum methods guarantee convergence under unbounded heterogeneity in decentralized learning scenarios with partial client participation?
Through rigorous theoretical analysis, clarify the fundamental limitations of classical momentum methods to provide theoretical guidance for federated learning algorithm design.
The main contributions of this paper include:
Consider a distributed learning system where a set of clients S collaborate to solve a learning problem, formalized as a finite-sum optimization problem:
Where:
To analyze momentum behavior under heterogeneity, the paper constructs the simplest scenario most favorable to momentum:
This setup satisfies:
Model the update rules of FedAvgM and FedCM as discrete-time linear systems:
z[t] = A[t]z[t-1] + Bu[t] \\ y[t] = Cz[t] \end{cases}$$ Where: - State: $z[t] = (\theta_t, \theta_{t-1})^T$ - Input: $u[t] = ((-1)^t q_t^{(a)} G)$ (heterogeneity-driven term) - Output: $y[t] = \theta_t$ - State matrix: $A[t] = \begin{pmatrix} p_t^{(a)} & -r_t^{(a)} \\ 1 & 0 \end{pmatrix}$ For single-step local updates ($J=1$), FedAvgM and FedCM have identical update rules: $$\theta_t = \theta_{t-1}(1 - \mu\tilde{\eta}_t + \beta) - \beta\theta_{t-2} + (-1)^t\tilde{\eta}_t G$$ Where $\tilde{\eta}_t = \eta_t(1-\beta)$. #### 3. System Response Decomposition By expanding the recursion, the system output can be decomposed as: $$y[t] = \underbrace{C\Psi(t,1)z[1]}_{\text{zero-input response}} + \underbrace{C\sum_{k=2}^t \Psi(t,k)Bu[k]}_{\text{zero-state response}}$$ Where the state transition matrix: $\Psi(t,k) := \prod_{s=k+1}^t A[s]$ **Physical Interpretation**: - **Zero-input response**: Corresponds to optimizing the shared objective $f_{hom}(\theta) = f(\theta)$, reflecting the influence of initial conditions - **Zero-state response**: Corresponds to heterogeneity terms $f_{het}(\theta) = \pm G\theta$ acting as external disturbances ### Technical Innovations #### 1. Systems Theory Perspective - First to model momentum algorithms in federated learning as discrete-time linear systems - Through zero-input/zero-state response decomposition, clearly reveals the mechanism of heterogeneity acting as a "disturbance signal" #### 2. Diagonalization Technique (Theorem III.6 Proof) For time-varying systems, decompose the state matrix as: $$A[t] = A_\infty + E[t]$$ Where $A_\infty$ corresponds to the limiting matrix as $\eta_t \to 0$, then through diagonalization: $$\bar{z}[t] = P^{-1}z[t] = (\Lambda + H[t])\bar{z}[t-1] + Wu[t]$$ Obtain eigenvalues $\lambda_1 = 1$ (marginally stable) and $\lambda_2 = \beta < 1$ (asymptotically stable) corresponding to decoupled directions. #### 3. Self-Consistent Ansatz Method For coupled systems, assume the asymptotic form of $\bar{z}_1[t]$ and verify whether the resulting $\bar{z}_2[t]$ leads to consistent conclusions. ## Main Theoretical Results ### Theorem III.5: Convergence Rate with Constant Step Size **Theorem Statement**: For any positive constants $G, \mu$, there exist $\mu$-strongly convex functions satisfying Assumption III.2 such that under appropriate constant step size $\eta$ and arbitrary momentum factor $\beta \in [0,1)$, FedCM and FedAvgM achieve asymptotic error under cyclic partial participation of: $$f(\theta_t) - f(\theta^*) = \Theta\left(\frac{G^2}{\mu T^2}\right)$$ **Key Insights**: 1. **Zero-input response**: Under eigenvalue condition $\eta \in (0, \frac{2(1+\beta)}{\mu(1-\beta)})$, converges at exponential rate 2. **Zero-state response**: Converges to a 2-periodic limit cycle with amplitude: $$|\theta_\infty| = \frac{\eta(1-\beta)G}{2(1+\beta) - \mu\eta(1-\beta)}$$ 3. **Step size constraint**: To control convergence error, must choose $\eta = \Theta(1/T)$, leading to linear convergence rate $O(1/T^2)$ **Physical Meaning**: Momentum cannot eliminate periodic oscillations caused by heterogeneity; the amplitude must be controlled by reducing step size. ### Theorem III.6: Convergence Rate with Decreasing Step Size **Theorem Statement**: For polynomial decreasing step size $\eta_t \sim O(1/t^\alpha)$, even when initialized at the optimal solution ($\theta_0 = \theta^*$), the error is: $$f(\theta_t) - f(\theta^*) = \begin{cases} \Theta\left(\frac{G^2}{\mu t^{2\alpha}}\right) & \text{if } 0 < \alpha < 1 \\ \Theta\left(\frac{G^2}{\mu t^{2\min(\mu\eta, 1)}}\right) & \text{if } \alpha = 1 \\ \Theta\left(\frac{G^2}{\mu}\right) & \text{if } \alpha > 1 \end{cases}$$ **Key Findings**: 1. **Slow Decay ($0 < \alpha < 1$)**: - Zero-input response decays at polynomial rate $O(t^{-\alpha})$ - Zero-state response still decays exponentially - Convergence rate $O(t^{-2\alpha})$ slower than constant step size's $O(T^{-2})$ 2. **Linear Decay ($\alpha = 1$)**: - Convergence rate depends on initial step size $\eta$ - When $\eta < 1/\mu$, initialization affects convergence rate $O(t^{-\mu\eta})$ - When $\eta \geq 1/\mu$, convergence rate is $O(t^{-1})$ 3. **Fast Decay ($\alpha > 1$)**: - **Cannot converge to optimal solution**, converges to constant $\Theta(G/\mu)$ - State transition matrix no longer decays to zero - Both zero-input and zero-state responses converge to constants dependent on $G$ and $\theta_0$ **Mathematical Intuition**: Through Lemmas B.2-B.9, establish asymptotic bounds on state transition matrices $\Psi_1(t,s,\alpha)$ and $\Psi_2(t,s,\alpha)$, precisely characterizing convergence behavior across different $\alpha$ ranges. ## Experimental Setup ### Theoretical Experiments **Objective Functions**: $f_1(\theta) = \frac{\mu}{2}\theta^2 + G\theta$, $f_2(\theta) = \frac{\mu}{2}\theta^2 - G\theta$ **Parameter Settings**: - $\mu = 1$ (strong convexity parameter) - $G \in \{0, 10, 100\}$ (heterogeneity levels) - $\theta_0 \in \{0, 10\}$ (initialization) - $\beta = 0.9$ (momentum factor) - $T = 10^6$ (iteration count) **Step Size Schedules**: 1. **Constant step size**: $\eta_t = \eta$ 2. **Polynomial decay**: $\eta_t = \eta/t^\alpha$, $\alpha \in \{0.1, 0.5, 1, 2\}$ 3. **Exponential decay**: $\eta_t = \eta\gamma^t$, $\gamma \in \{0.9999, 0.999, 0.99, 0.9\}$ ### Deep Learning Experiments **Dataset**: CIFAR-10 - Training set preprocessing: random cropping, random horizontal flipping, normalization - Number of clients: $|S| = 100$ - Data partitioning: Following [19], simulating highest heterogeneity level (Dirichlet distribution) **Model Architectures**: 1. **CNN**: LeNet-5-like architecture 2. **ResNet-20**: Using Group Normalization instead of Batch Normalization **Training Settings**: - Client sampling rate: $C = 10\%$ (cyclic sampling) - Local steps: $J = 1$ - Momentum factor: $\beta = 0.9$ - Repetitions: 5 independent runs **Hyperparameter Search**: - FedAvg: Server step size $\eta \in \{2, 1.5, 1, 0.5, 0.1\}$, local step size $\eta_l \in \{0.1, 0.05, 0.01, 0.005\}$ - FedCM: Similar search range ## Experimental Results ### Theoretical Experiment Results (Table I) #### Key Findings: 1. **Linear Effect of Heterogeneity**: - When $G = 100$: $\theta_t \approx 2.5 \times 10^{-5}$ (constant step size) - When $G = 10$: $\theta_t \approx 2.5 \times 10^{-6}$ (constant step size) - Proportional relationship validates theoretical prediction of $\Theta(G/\mu T)$ 2. **Effect of Initialization**: - For slow decay ($\alpha < 1$) and constant step size, final values are identical for $\theta_0 = 0$ and $\theta_0 = 10$ - Validates exponential decay property of zero-input response 3. **Harm of Fast Decay** ($\alpha = 2$): - $G = 100, \theta_0 = 0$: $\theta_t = 4.8 \times 10^1$ - $G = 100, \theta_0 = 10$: $\theta_t = 5.7 \times 10^1$ - Cannot converge to optimal solution $\theta^* = 0$, and depends on initialization 4. **Momentum vs. No Momentum Comparison**: - Convergence behavior with momentum (left) and without momentum (right) are similar - Proves momentum cannot fundamentally improve dependence on heterogeneity ### Step Size Effect Experiment (Table II) Validates theoretical predictions from Theorem III.6 with $\alpha = 1$: | Initial Step Size | $\theta_t$ ($\theta_0=0$) | $\theta_t$ ($\theta_0=10$) | |---------|--------------------------|---------------------------| | $\eta = \frac{1(1+\beta)}{\mu(1-\beta)} - \epsilon$ | $2.5 \times 10^{-6}$ | $2.5 \times 10^{-6}$ | | $\eta = \frac{1}{\mu} - \epsilon$ | $-3.9 \times 10^{-6}$ | $-1.2 \times 10^{-4}$ | When $\eta < 1/\mu$, final values depend on initialization, validating theoretical convergence rate of $O(t^{-\mu\eta})$. ### Deep Learning Experiment Results (Figure 1) **Experimental Setup**: CIFAR-10, cyclic client participation, high heterogeneity **Result Observations**: 1. **FedAvg vs FedCM (ResNet-20)**: - Test accuracy after 10000 rounds: approximately 60-70% - Centralized training reference accuracy: ≈89% - Significant performance gap indicates momentum cannot effectively mitigate heterogeneity 2. **FedAvg vs FedCM (CNN)**: - Test accuracy after 10000 rounds: approximately 50-60% - Centralized training reference accuracy: ≈86% - FedAvg and FedCM show similar performance with no clear advantage 3. **Key Insights**: - Under high heterogeneity and partial participation, classical momentum-based FL methods cannot provide substantial improvements - Experimental results align with theoretical analysis: momentum cannot eliminate fundamental effects of heterogeneity ## Related Work ### Finite-Sum Optimization and SGD Variants 1. **SGD and Random Shuffling Methods**: - [12] Safran & Shamir 2020: Performance of randomly shuffled SGD - [13] Koloskova et al. 2024: IGD convergence rates for non-convex smooth functions - [14] Liu & Zhou 2024: Last-iterate convergence of shuffled gradient methods 2. **SGD Lower Bounds**: - [15] Jentzen & von Wurstemberger 2020: Lower bounds for decreasing step sizes - [16] Nguyen et al. 2019: Dimension-independent lower bounds - [17] Kim et al. 2025: IGD analysis for ill-conditioned problems with small epochs **Key Difference**: None of the above works consider momentum, and all require heterogeneity bounds. This paper proves that even with momentum, this fundamental dependence persists. ### Momentum in Distributed Learning 1. **Momentum Algorithms in Federated Learning**: - [2] FedAvgM (Hsu et al. 2019): Server-side momentum - [4] FedCM (Xu et al. 2021): Client-side momentum - [5] FedADC: Accelerated drift control - [6-7] Multi-step inertial momentum methods 2. **Theoretical Progress**: - [8] Cheng et al. 2024: Proves momentum can converge under unbounded heterogeneity with full participation - [9] GHBM (Zaccone et al. 2025): Bypasses limitations through incremental aggregated gradient perspective - [10] SlowMo: Communication-efficient distributed SGD - [11] DiLoCo: Distributed low-communication language model training ### Unique Contribution of This Paper This is the **first work to systematically analyze fundamental limitations of classical momentum under partial participation**: - Definitively answers the open question "Can momentum eliminate heterogeneity effects under partial participation?" (Answer: No) - Provides complete theoretical analysis framework (linear systems perspective) - Proves GHBM [9] is currently the only momentum algorithm that bypasses this limitation ## Conclusions and Discussion ### Main Conclusions 1. **Fundamental Limitations of Momentum**: Under cyclic client participation, classical momentum (FedAvgM and FedCM) **cannot eliminate the effects of statistical heterogeneity**, with convergence rates still dependent on heterogeneity bound $G$ 2. **Negative Results on Decreasing Step Sizes**: - Decay slower than $\Theta(1/t)$: Slower convergence rate - Decay equal to $\Theta(1/t)$: Convergence rate depends on initial step size - Decay faster than $\Theta(1/t)$: **Cannot converge to optimal solution** 3. **Effect of Local Steps**: Increasing local steps $J$ worsens dependence on heterogeneity through client drift effects, but does not change asymptotic convergence rate 4. **Special Nature of GHBM**: GHBM [9] is currently the only known momentum algorithm that bypasses this limitation under partial participation ### Limitations 1. **Analysis Scope**: - Only analyzes cyclic client participation patterns - Random uniform sampling may exhibit different behavior (though [9]'s experiments show similar results) 2. **Problem Setting**: - Considers strongly convex functions - Actual deep learning is non-convex optimization; full applicability of theoretical results requires further investigation 3. **Simplified Scenarios**: - Two-client construction is most favorable to momentum - Real scenarios may be more complex, but theoretical lower bounds already reveal fundamental limitations 4. **Experimental Scale**: - Deep learning experiments only on CIFAR-10 - Validation on larger datasets and models remains to be done ### Future Directions 1. **Extension to Non-Convex Optimization**: Extend theoretical analysis to non-convex loss functions common in deep learning 2. **Random Sampling Analysis**: Analyze convergence properties under random uniform client sampling 3. **Improved Algorithm Design**: - Explore other momentum variants beyond GHBM that may bypass limitations - New methods combining adaptive learning rates and momentum 4. **Practical System Optimization**: Validate theory-guided algorithm design in actual federated learning systems ## In-Depth Evaluation ### Strengths #### 1. Depth of Theoretical Contribution - **Rigorous Mathematical Proofs**: Provides complete convergence analysis through discrete-time linear systems theory - **Precise Convergence Rate Bounds**: Not only provides asymptotic complexity but also analyzes constant factors - **Comprehensive Scenario Coverage**: Systematically analyzes four cases: constant step size, slow decay, linear decay, and fast decay #### 2. Innovation of Methodology - **Systems Theory Perspective**: First to model federated learning algorithms as linear systems, providing novel analysis framework - **Zero-Input/Zero-State Response Decomposition**: Clearly reveals interaction between shared objective optimization and heterogeneity disturbance - **Diagonalization Technique**: Elegantly handles analysis challenges of time-varying systems #### 3. Sufficiency of Experiments - **Complete Theoretical Validation**: Tables I and II precisely validate all key theoretical predictions - **Practical Relevance**: CIFAR-10 experiments demonstrate applicability of theoretical findings to actual deep learning - **Comprehensive Comparisons**: Simultaneously compares performance with/without momentum and different step size schedules #### 4. Clarity of Writing - **Progressive Development**: From problem construction to system modeling to theoretical analysis, logic is clear - **Intuitive Explanations**: Provides physical intuition and mathematical meaning for each theoretical result - **Detailed Appendix**: Complete proof details (25-page appendix) ensure reproducibility ### Weaknesses #### 1. Limitations of Theoretical Analysis - **Strong Convexity Assumption**: Actual deep learning is non-convex; generalizability of theoretical results is limited - **Simplified Scenarios**: Two-client, one-dimensional parameter setting is overly idealized - **Cyclic Sampling**: Real systems typically use random sampling; applicability of theoretical results needs further verification #### 2. Defects in Experimental Setup - **Single Dataset**: Only validates on CIFAR-10, lacking experiments on large-scale datasets like ImageNet - **Limited Model Scale**: ResNet-20 is relatively small; behavior on modern large models (e.g., Transformers) is unknown - **Insufficient Baselines**: No direct comparison with GHBM, unable to quantify performance gaps #### 3. Practical Considerations - **Negative Results**: Primarily proves "what doesn't work," providing limited guidance on "what does work" - **Hyperparameter Sensitivity**: Theory requires precise step size selection (e.g., $\eta = \Theta(1/T)$), difficult to determine $T$ in advance - **Communication Costs**: Does not consider tradeoff between communication rounds and computational cost #### 4. Analysis Depth - **Insufficient Analysis of Local Steps**: While mentioning $J > 1$ worsens dependence, lacks precise quantification - **Effect of Different Momentum Factors**: Theory treats $\beta$ as arbitrary but lacks detailed exploration of selection strategies - **Convergence Constants**: Asymptotic analysis hides constant factors; actual convergence speed may vary significantly ### Impact #### 1. Contribution to the Field - **Theoretical Foundation**: Provides rigorous theoretical basis for momentum use in federated learning - **Answering Open Questions**: Definitively answers community's concern "Can momentum overcome heterogeneity?" - **Research Direction**: Highlights importance of new momentum methods like GHBM #### 2. Practical Value - **Algorithm Design Guidance**: - Avoid step size schedules with too-fast decay ($\alpha > 1$) - Classical momentum may underperform in high heterogeneity scenarios - Consider using GHBM and alternative methods - **Hyperparameter Tuning**: - Step size should be $\Theta(1/T)$ magnitude - Momentum factor $\beta$ selection requires balancing convergence speed and stability #### 3. Reproducibility - **Excellent**: - Complete proof details (Appendices A-B) - Clear experimental setup and hyperparameters - Simple, clear construction of theoretical problems - **Room for Improvement**: Code not publicly available (paper does not mention code repository) ### Applicable Scenarios #### Suitable Application Scenarios 1. **Theoretical Research**: - Convergence analysis in federated learning - Lower bound research for optimization algorithms - Quantitative analysis of heterogeneity effects 2. **Algorithm Selection Guidance**: - High heterogeneity, partial participation federated learning scenarios - Critical applications requiring theoretical guarantees (healthcare, finance) #### Less Suitable Scenarios 1. **Large-Scale Non-Convex Optimization**: Theory based on strong convexity; applicability to deep learning requires caution 2. **Full Participation Scenarios**: Existing work [8] proves momentum is viable with full participation; negative results here don't apply 3. **Communication-Constrained Scenarios**: Does not consider communication costs; may underestimate practical value of momentum ### Overall Assessment This is an **excellent paper with rigorous theory and clear contributions**. Through innovative linear systems analysis framework, it is the first to systematically reveal fundamental limitations of classical momentum in federated learning. Despite limitations such as strong theoretical assumptions and limited experimental scale, its core insights—**momentum cannot eliminate heterogeneity effects, and fast-decaying step sizes are harmful**—provide important guidance for federated learning algorithm design. The paper's main value lies in: 1. **Clarifying theoretical boundaries**: Draws clear boundaries for applicability of momentum methods 2. **Providing analysis tools**: Linear systems modeling is generalizable to analysis of other distributed algorithms 3. **Indicating research directions**: Emphasizes necessity of new methods like GHBM Recommendations for future work: 1. Extend to non-convex optimization and random sampling 2. Detailed theoretical and experimental comparison with GHBM 3. Validate theory-guided effectiveness in large-scale practical systems **Recommendation Index**: ★★★★☆ (4.5/5) - Theoretical Depth: ★★★★★ - Practical Value: ★★★★☆ - Innovation: ★★★★★ - Completeness: ★★★★☆ ## Key References [1] Polyak, B. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics. [2] Hsu et al. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv:1909.06335. [8] Cheng et al. (2024). Momentum benefits non-iid federated learning simply and provably. ICLR. [9] Zaccone et al. (2025). Communication-efficient heterogeneous federated learning with generalized heavy-ball momentum. Transactions on Machine Learning Research. [15] Jentzen & von Wurstemberger (2020). Lower error bounds for the stochastic gradient descent optimization algorithm. Journal of Complexity.