2025-11-27T01:52:18.796624

On the Limits of Momentum in Decentralized and Federated Optimization

Zaccone, Karimireddy, Masone

Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $Î\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

academic

On the Limits of Momentum in Decentralized and Federated Optimization

Basic Information

Paper ID: 2511.20168
Title: On the Limits of Momentum in Decentralized and Federated Optimization
Authors: Riccardo Zaccone (Polytechnic of Turin), Sai Praneeth Karimireddy (USC), Carlo Masone (Polytechnic of Turin)
Classification: cs.LG (Machine Learning), cs.AI
Publication Date: November 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.20168

Abstract

This paper provides an in-depth investigation into the theoretical limitations of momentum in federated learning and decentralized optimization. Although recent research has explored using momentum in local methods to enhance distributed SGD, particularly in federated learning to mitigate the effects of statistical heterogeneity, it remains unclear whether momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios with partial client participation. Through theoretical analysis of cyclic client participation patterns, this paper proves that momentum is inevitably affected by statistical heterogeneity. Furthermore, decreasing step sizes do not help: any schedule decaying faster than Θ(1/t) leads to convergence to a constant value dependent on initialization and heterogeneity bounds. Numerical experiments and deep learning experiments validate the theoretical findings and their relevance to practical scenarios.

Research Background and Motivation

Core Problem

The core problem addressed in this paper is: Can classical momentum methods guarantee convergence under unbounded heterogeneity in decentralized learning scenarios with partial client participation?

Problem Importance

Practical Requirements of Federated Learning: Modern deep learning applications require training on distributed data silos or personal devices, where clients typically cannot participate in every training round (due to network failures, privacy constraints, or temporary unavailability)
Statistical Heterogeneity Challenge: The non-independent and identically distributed (non-IID) nature of client data leads to client drift and biased server updates
Insufficient Theoretical Understanding: Although momentum is widely applied in distributed algorithms, theoretical understanding of its properties in decentralized environments remains incomplete

Limitations of Existing Methods

Momentum-based federated learning algorithms such as FedAvgM and FedCM perform well in practice but lack theoretical guarantees under partial participation
Existing theoretical results:
- 8 proved that momentum can converge under unbounded heterogeneity with full participation
- 9 proposed GHBM which achieves similar guarantees under cyclic partial participation
- However, the theoretical properties of classical momentum under partial participation remain unclear

Research Motivation

Through rigorous theoretical analysis, clarify the fundamental limitations of classical momentum methods to provide theoretical guidance for federated learning algorithm design.

Core Contributions

The main contributions of this paper include:

Theoretical Proof that Momentum Cannot Eliminate Heterogeneity Effects: Under cyclic client sampling, formally proves that momentum cannot eliminate the effects of data heterogeneity—a core problem in decentralized and federated learning
Negative Results on Decreasing Step Sizes: Proves that any step size schedule decaying faster than Θ(1/t) leads to convergence to a constant value dependent on initialization and heterogeneity bounds, rather than the optimal solution
Systematic Analysis Framework: By modeling algorithm dynamics as discrete-time linear systems, provides clear decomposition:
- Zero-input response captures the shared objective across all clients
- Zero-state response isolates heterogeneity objectives
Experimental Validation: Validates theoretical findings through numerical experiments on theoretical problems and deep learning experiments (CIFAR-10) demonstrating their relevance to practical scenarios

Methodology Details

Task Definition

Consider a distributed learning system where a set of clients S collaborate to solve a learning problem, formalized as a finite-sum optimization problem:

$\theta^* = \arg\min_{\theta \in \mathbb{R}^d} \left[ f(\theta) := \frac{1}{|S|} \sum_{i \in S} f_i(\theta) \right]$

Where:

$f_i(\theta)$ is the local objective function of client $i$
$f(\theta)$ is the global objective function
In each round $t$ , only a subset $S_t \subset S$ of clients participate (partial participation)

Theoretical Analysis Framework

1. Construction of Minimal Heterogeneity Problem

To analyze momentum behavior under heterogeneity, the paper constructs the simplest scenario most favorable to momentum:

Two clients: $f_1(\theta) = \frac{\mu}{2}\theta^2 + G\theta$ , $f_2(\theta) = \frac{\mu}{2}\theta^2 - G\theta$
Cyclic sampling: Alternately select one client each round
Global objective: $f(\theta) = \frac{1}{2}(f_1(\theta) + f_2(\theta)) = \frac{\mu}{2}\theta^2$ , with optimal solution $\theta^* = 0$

This setup satisfies:

$\mu$ -strong convexity (Assumption III.1)
Bounded gradient variance: $\frac{1}{|S|}\sum_{i=1}^{|S|} \|\nabla f_i(\theta) - \nabla f(\theta)\| \leq G$ (Assumption III.2)
Cyclic participation (Assumption III.3)

2. Discrete-Time Linear System Modeling (Lemma III.4)

Model the update rules of FedAvgM and FedCM as discrete-time linear systems: