Algorithms for decentralized optimization and learning rely on local optimization steps coupled with combination steps over a graph. Recent works have demonstrated that using a time-varying sequence of matrices that achieves finite-time consensus can improve the communication and iteration complexity of decentralized optimization algorithms based on gradient tracking. In practice, a sequence of matrices satisfying the exact finite-time consensus property may not be available due to imperfect knowledge of the network topology, a limit on the length of the sequence, or numerical instabilities. In this work, we quantify the impact of approximate finite-time consensus sequences on the convergence of a gradient-tracking based decentralized optimization algorithm. Our results hold for any periodic sequence of combination matrices. We clarify the interplay between approximation error of the finite-time consensus sequence and the length of the sequence as well as typical problem parameters such as smoothness and gradient noise.
Paper ID : 2505.23577Title : On the Convergence of Decentralized Stochastic Gradient-Tracking with Finite-Time ConsensusAuthors : Aaron Fainman, Stefan Vlaski (Imperial College London)Categories : math.OC (Optimization and Control), eess.SP (Signal Processing)Publication Date : November 24, 2025 (arXiv v2)Paper Link : https://arxiv.org/abs/2505.23577 Preliminary Version : Published at Asilomar Conference on Signals, Systems, and Computers, 2025This paper investigates the convergence performance of decentralized optimization algorithms based on gradient-tracking when using approximate finite-time consensus (FTC) matrix sequences. In practical applications, matrix sequences satisfying exact FTC properties may be unavailable due to imperfect network topology knowledge, sequence length constraints, or numerical instability. The paper quantifies the impact of approximate FTC sequences on algorithm convergence, with results applicable to arbitrary periodic composite matrix sequences. It clarifies the interactions between FTC sequence approximation error, sequence length, and problem parameters such as smoothness and gradient noise.
In decentralized optimization, K agents cooperatively solve an aggregated optimization problem:
w o = arg min w ∈ R M J ( w ) = arg min w ∈ R M 1 K ∑ k = 1 K J k ( w ) w^o = \arg\min_{w \in \mathbb{R}^M} J(w) = \arg\min_{w \in \mathbb{R}^M} \frac{1}{K}\sum_{k=1}^K J_k(w) w o = arg min w ∈ R M J ( w ) = arg min w ∈ R M K 1 ∑ k = 1 K J k ( w )
Each agent can only access its local data and coordinates optimization through neighbor communication on a network graph.
Performance bounds for decentralized optimization algorithms typically consist of two components:
Optimization Error : Arising from iterative gradient updates, representing an inherent lower bound for distributed algorithmsConsensus Error : Arising from limited network information diffusion, significantly affecting overall performance in sparsely connected networksClassical Weight Rules (e.g., Metropolis-Hastings): Only guarantee asymptotic convergenceExact FTC Sequences : Achieve exact consensus within finite steps τ, but are difficult to obtain in practice:
Imperfect network topology knowledge Numerical methods (eigenvalue decomposition, graph filtering) introduce errors Sequence length constraints Although empirical evidence suggests approximate FTC sequences still provide significant benefits, theoretical analysis quantifying their impact is lacking. This paper fills this theoretical gap by analyzing the specific impact of approximation error ϵ_τ on algorithm convergence.
Theoretical Guarantees : Provides complete performance bounds for gradient-tracking algorithms (Aug-DGM) using approximate FTC sequences, applicable to arbitrary periodic matrix sequences. Compared to existing bounds that don't exploit periodicity (e.g., DIGing's ( 1 − Θ ( μ ν ) ) 1 / ( 2 τ ) (1-\Theta(\mu\nu))^{1/(2\tau)} ( 1 − Θ ( μν ) ) 1/ ( 2 τ ) convergence rate), this achieves a significantly improved amortized convergence rate of 1 − Θ ( ν μ ) + Θ ( ν ϵ τ μ ) 1-\Theta(\nu\mu)+\Theta(\nu\epsilon_\tau\mu) 1 − Θ ( νμ ) + Θ ( ν ϵ τ μ ) .Dual Time-Scale Analysis : Proposes a novel analytical framework that reconciles the algorithm's dual time-scale behavior:Single-step monotonic decrease of centroid error Periodic τ-step decrease of consensus error Trade-off Characterization : Precisely quantifies the trade-off between approximation error ϵ_τ and sequence length τ, revealing that FTC sequences are particularly effective when τ is small, and in some cases approximate FTC outperforms exact FTC.Tight Bound Analysis : Steady-state error is O ( μ σ 2 K + μ 2 τ 2 σ 2 K ( 1 − ϵ τ ) 2 + μ 2 τ 2 σ 2 ( 1 − ϵ τ ) 2 ) O\left(\frac{\mu\sigma^2}{K} + \frac{\mu^2\tau^2\sigma^2}{K(1-\epsilon_\tau)^2} + \frac{\mu^2\tau^2\sigma^2}{(1-\epsilon_\tau)^2}\right) O ( K μ σ 2 + K ( 1 − ϵ τ ) 2 μ 2 τ 2 σ 2 + ( 1 − ϵ τ ) 2 μ 2 τ 2 σ 2 ) , clearly showing the impact of each parameter.Consider K agents on an undirected graph G=(N,E) cooperatively solving problem (1), where:
Each agent k can only access local objective J k ( w ) = E Q k ( w ; x k ) J_k(w) = \mathbb{E}Q_k(w;x_k) J k ( w ) = E Q k ( w ; x k ) Agents can only exchange information with neighbors N k \mathcal{N}_k N k Using periodic composite matrix sequences { A i } \{A_i\} { A i } satisfying approximate FTC property:
ϵ τ ≜ ∥ A τ ⋯ A 2 A 1 − 1 K 11 T ∥ \epsilon_\tau \triangleq \left\|A_\tau \cdots A_2 A_1 - \frac{1}{K}\mathbf{1}\mathbf{1}^T\right\| ϵ τ ≜ A τ ⋯ A 2 A 1 − K 1 1 1 T Agent k executes at iteration i:
Model Update :
w k , i = ∑ ℓ ∈ N k a ℓ k , i ( w ℓ , i − 1 − g ℓ , i − 1 ) w_{k,i} = \sum_{\ell \in \mathcal{N}_k} a_{\ell k,i}(w_{\ell,i-1} - g_{\ell,i-1}) w k , i = ∑ ℓ ∈ N k a ℓ k , i ( w ℓ , i − 1 − g ℓ , i − 1 )
Gradient Tracking Update :
g k , i = ∑ ℓ ∈ N k a ℓ k , i ( g ℓ , i − 1 + μ ∇ ^ J ℓ ( w ℓ , i ) − μ ∇ ^ J ℓ ( w ℓ , i − 1 ) ) g_{k,i} = \sum_{\ell \in \mathcal{N}_k} a_{\ell k,i}\left(g_{\ell,i-1} + \mu\hat{\nabla}J_\ell(w_{\ell,i}) - \mu\hat{\nabla}J_\ell(w_{\ell,i-1})\right) g k , i = ∑ ℓ ∈ N k a ℓ k , i ( g ℓ , i − 1 + μ ∇ ^ J ℓ ( w ℓ , i ) − μ ∇ ^ J ℓ ( w ℓ , i − 1 ) )
Where:
μ \mu μ is the step sizeA i = A i % τ A_i = A_{i\%\tau} A i = A i % τ cycles through the FTC sequence periodically∇ ^ J k ( ⋅ ) \hat{\nabla}J_k(\cdot) ∇ ^ J k ( ⋅ ) is the stochastic gradient approximationNetwork-Level Representation :
W i = A i ( W i − 1 − G i − 1 ) W_i = \mathcal{A}_i(W_{i-1} - G_{i-1}) W i = A i ( W i − 1 − G i − 1 ) G i = A i ( G i − 1 + μ ∇ ^ J ( W i ) − μ ∇ ^ J ( W i − 1 ) ) G_i = \mathcal{A}_i(G_{i-1} + \mu\hat{\nabla}J(W_i) - \mu\hat{\nabla}J(W_{i-1})) G i = A i ( G i − 1 + μ ∇ ^ J ( W i ) − μ ∇ ^ J ( W i − 1 ))
Simplifies analysis through two transformations:
First: Y i ≜ G i − μ A i ∇ ^ J ( W i ) Y_i \triangleq G_i - \mu\mathcal{A}_i\hat{\nabla}J(W_i) Y i ≜ G i − μ A i ∇ ^ J ( W i ) Second: Z i ≜ Y i + μ A i ∇ J ( W c , i ) Z_i \triangleq Y_i + \mu\mathcal{A}_i\nabla J(W_{c,i}) Z i ≜ Y i + μ A i ∇ J ( W c , i ) Yielding coupled recursions:
W i = A i W i − 1 − A i Z i − 1 + drift terms W_i = \mathcal{A}_iW_{i-1} - \mathcal{A}_iZ_{i-1} + \text{drift terms} W i = A i W i − 1 − A i Z i − 1 + drift terms Z i = A i Z i − 1 + correction terms Z_i = \mathcal{A}_iZ_{i-1} + \text{correction terms} Z i = A i Z i − 1 + correction terms
Decomposes total error into:
Centroid Error : w ~ c , i ≜ w o − w c , i \tilde{w}_{c,i} \triangleq w^o - w_{c,i} w ~ c , i ≜ w o − w c , i , where w c , i = 1 K ∑ k = 1 K w k , i w_{c,i} = \frac{1}{K}\sum_{k=1}^K w_{k,i} w c , i = K 1 ∑ k = 1 K w k , i Consensus Error : X ^ i ≜ [ W ^ i T , Z ^ i T ] T \hat{X}_i \triangleq [\hat{W}_i^T, \hat{Z}_i^T]^T X ^ i ≜ [ W ^ i T , Z ^ i T ] T , where W ^ i = I ^ W i \hat{W}_i = \hat{I}W_i W ^ i = I ^ W i , I ^ = ( I K − 1 K 11 T ) ⊗ I M \hat{I} = (I_K - \frac{1}{K}\mathbf{1}\mathbf{1}^T)\otimes I_M I ^ = ( I K − K 1 1 1 T ) ⊗ I M Consensus Error Analysis (Lemma 3):
Traces back from iteration i to the nearest complete FTC sequence at m τ + 1 m\tau+1 m τ + 1 (where m = ⌊ i / τ ⌋ − 1 m=\lfloor i/\tau \rfloor - 1 m = ⌊ i / τ ⌋ − 1 ):
X ^ i = G i : m τ + 1 X ^ m τ − μ ∑ j = m τ + 1 i − 1 G i : j + 1 h j − μ h i − noise terms \hat{X}_i = \mathcal{G}_{i:m\tau+1}\hat{X}_{m\tau} - \mu\sum_{j=m\tau+1}^{i-1}\mathcal{G}_{i:j+1}h_j - \mu h_i - \text{noise terms} X ^ i = G i : m τ + 1 X ^ m τ − μ ∑ j = m τ + 1 i − 1 G i : j + 1 h j − μ h i − noise terms
Key bound:
E ∥ X ^ i ∥ 2 ≤ θ 1 E ∥ X ^ m τ ∥ 2 + θ 2 ∑ j = m τ i − 1 E ∥ X ^ j ∥ 2 + θ 3 ∑ j = m τ i − 1 E ∥ w ~ c , j ∥ 2 + θ 4 \mathbb{E}\|\hat{X}_i\|^2 \leq \theta_1\mathbb{E}\|\hat{X}_{m\tau}\|^2 + \theta_2\sum_{j=m\tau}^{i-1}\mathbb{E}\|\hat{X}_j\|^2 + \theta_3\sum_{j=m\tau}^{i-1}\mathbb{E}\|\tilde{w}_{c,j}\|^2 + \theta_4 E ∥ X ^ i ∥ 2 ≤ θ 1 E ∥ X ^ m τ ∥ 2 + θ 2 ∑ j = m τ i − 1 E ∥ X ^ j ∥ 2 + θ 3 ∑ j = m τ i − 1 E ∥ w ~ c , j ∥ 2 + θ 4
Where θ 1 = ϵ τ 2 ( 1 + ϵ τ ) \theta_1 = \frac{\epsilon_\tau}{2}(1+\epsilon_\tau) θ 1 = 2 ϵ τ ( 1 + ϵ τ ) is nonzero only when ϵ τ > 0 \epsilon_\tau>0 ϵ τ > 0 .
Centroid Error Analysis (Lemma 4):
Single-step recursion:
E ∥ w ~ c , i ∥ 2 ≤ α 1 E ∥ w ~ c , i − 1 ∥ 2 + α 2 E ∥ X ^ i − 1 ∥ 2 + α 3 \mathbb{E}\|\tilde{w}_{c,i}\|^2 \leq \alpha_1\mathbb{E}\|\tilde{w}_{c,i-1}\|^2 + \alpha_2\mathbb{E}\|\hat{X}_{i-1}\|^2 + \alpha_3 E ∥ w ~ c , i ∥ 2 ≤ α 1 E ∥ w ~ c , i − 1 ∥ 2 + α 2 E ∥ X ^ i − 1 ∥ 2 + α 3
Where α 1 = 1 − ν μ 2 \alpha_1 = 1 - \frac{\nu\mu}{2} α 1 = 1 − 2 νμ (driven by strong convexity constant).
To reconcile the two time scales, define:
ω m = max i ∈ S m E ∥ w ~ c , i ∥ 2 \omega_m = \max_{i\in S_m}\mathbb{E}\|\tilde{w}_{c,i}\|^2 ω m = max i ∈ S m E ∥ w ~ c , i ∥ 2 χ m = max i ∈ S m E ∥ X ^ i ∥ 2 \chi_m = \max_{i\in S_m}\mathbb{E}\|\hat{X}_i\|^2 χ m = max i ∈ S m E ∥ X ^ i ∥ 2 Where S m = { ( m − 1 ) τ , … , m τ − 1 } S_m = \{(m-1)\tau, \ldots, m\tau-1\} S m = {( m − 1 ) τ , … , m τ − 1 } is the m-th sequence block.
Derives coupled recursion (core result):
[ ω m χ m ] ≤ H [ ω m − 1 χ m − 1 ] + p + O ( μ 3 ) \begin{bmatrix}\omega_m \\ \chi_m\end{bmatrix} \leq H\begin{bmatrix}\omega_{m-1} \\ \chi_{m-1}\end{bmatrix} + p + O(\mu^3) [ ω m χ m ] ≤ H [ ω m − 1 χ m − 1 ] + p + O ( μ 3 )
Where:
H = [ 1 − τ ν μ 4 α 2 τ ( 1 + 3 2 θ 1 ) 3 2 τ θ 3 3 2 ( θ 1 + τ θ 2 ) ] H = \begin{bmatrix}1-\frac{\tau\nu\mu}{4} & \alpha_2\tau(1+\frac{3}{2}\theta_1) \\ \frac{3}{2}\tau\theta_3 & \frac{3}{2}(\theta_1+\tau\theta_2)\end{bmatrix} H = [ 1 − 4 τνμ 2 3 τ θ 3 α 2 τ ( 1 + 2 3 θ 1 ) 2 3 ( θ 1 + τ θ 2 ) ]
Through careful analysis (Lemma 5):
ρ ( H ) ≤ 1 − τ ν μ 8 + 3 τ ν μ ϵ τ ( 1 + ϵ τ ) 32 + O ( μ 3 ) \rho(H) \leq 1 - \frac{\tau\nu\mu}{8} + \frac{3\tau\nu\mu\epsilon_\tau(1+\epsilon_\tau)}{32} + O(\mu^3) ρ ( H ) ≤ 1 − 8 τνμ + 32 3 τνμ ϵ τ ( 1 + ϵ τ ) + O ( μ 3 )
Condition: μ ≤ ν K ( 1 − ϵ τ ) 72 τ δ δ 2 + β 2 \mu \leq \frac{\nu\sqrt{K(1-\epsilon_\tau)}}{72\tau\delta\sqrt{\delta^2+\beta^2}} μ ≤ 72 τ δ δ 2 + β 2 ν K ( 1 − ϵ τ )
Perturbation Term Bounds (Lemma 2):Gradient noise term v i v_i v i : E [ ∥ v i ∥ 2 ∣ W i − 1 ] ≤ 9 β 2 ζ 2 + 9 β 2 K ∥ w ~ c , i − 1 ∥ 2 + 9 β 2 ∥ W ^ i − 1 ∥ 2 + 3 σ 2 \mathbb{E}[\|v_i\|^2|W_{i-1}] \leq 9\beta^2\zeta^2 + 9\beta^2K\|\tilde{w}_{c,i-1}\|^2 + 9\beta^2\|\hat{W}_{i-1}\|^2 + 3\sigma^2 E [ ∥ v i ∥ 2 ∣ W i − 1 ] ≤ 9 β 2 ζ 2 + 9 β 2 K ∥ w ~ c , i − 1 ∥ 2 + 9 β 2 ∥ W ^ i − 1 ∥ 2 + 3 σ 2 Drift term h i h_i h i : E [ ∥ h i ∥ 2 ∣ W i − 1 ] ≤ 3 ( 2 δ 2 + 1 2 β 2 ) ∥ W ^ i − 1 ∥ + 2 ( δ 2 + β 2 K ) ∥ w ~ c , i − 1 ∥ 2 + 3 2 β 2 ζ 2 + 1 2 σ 2 \mathbb{E}[\|h_i\|^2|W_{i-1}] \leq 3(2\delta^2+\frac{1}{2}\beta^2)\|\hat{W}_{i-1}\| + 2(\delta^2+\beta^2K)\|\tilde{w}_{c,i-1}\|^2 + \frac{3}{2}\beta^2\zeta^2 + \frac{1}{2}\sigma^2 E [ ∥ h i ∥ 2 ∣ W i − 1 ] ≤ 3 ( 2 δ 2 + 2 1 β 2 ) ∥ W ^ i − 1 ∥ + 2 ( δ 2 + β 2 K ) ∥ w ~ c , i − 1 ∥ 2 + 2 3 β 2 ζ 2 + 2 1 σ 2 Young's Inequality Application : Systematically uses Young's inequality to separate cross terms, with parameter selection α j = 1 γ − j \alpha_j = \frac{1}{\gamma-j} α j = γ − j 1 achieving optimal balance.Matrix Norm Techniques : Exploits block upper triangular structure ∥ G i : m τ + 1 ∥ ≤ ϵ τ \|\mathcal{G}_{i:m\tau+1}\| \leq \epsilon_\tau ∥ G i : m τ + 1 ∥ ≤ ϵ τ (when i ≥ ( m + 1 ) τ i \geq (m+1)\tau i ≥ ( m + 1 ) τ ).Linear Regression Problem :
Data generation model: γ k , n = h k , n T w o + v n \gamma_{k,n} = h_{k,n}^T w^o + v_n γ k , n = h k , n T w o + v n Feature dimension: M = 20 M=20 M = 20 Samples per agent: N = 30 N=30 N = 30 Feature distribution: h k , n ∼ N ( 0 , I M ) h_{k,n} \sim \mathcal{N}(0, I_M) h k , n ∼ N ( 0 , I M ) Noise distribution: v k , n ∼ N ( 0 , 0.1 ) v_{k,n} \sim \mathcal{N}(0, 0.1) v k , n ∼ N ( 0 , 0.1 ) Objective function: J k ( w ) = 1 2 N ∑ n = 1 N ( γ k , n − h k , n T w ) 2 J_k(w) = \frac{1}{2N}\sum_{n=1}^N(\gamma_{k,n} - h_{k,n}^T w)^2 J k ( w ) = 2 N 1 ∑ n = 1 N ( γ k , n − h k , n T w ) 2 Multiple graph structures tested:
Path Graph : K=16 agents, τ=7Hypercube : K=8 agents, τ=3Varying τ : Fixed K=16, varying τMean Squared Deviation (MSD) :
MSD = E ∥ W i − 1 ⊗ w o ∥ 2 \text{MSD} = \mathbb{E}\|W_i - \mathbf{1}\otimes w^o\|^2 MSD = E ∥ W i − 1 ⊗ w o ∥ 2
Exact FTC : Constructed using graph-theoretic methods satisfying ϵ τ = 0 \epsilon_\tau=0 ϵ τ = 0 Approximate FTC :
Add noise to nonzero elements Adjust diagonal elements maintaining double stochasticity Test ϵ τ ∈ { 0.3 , 0.4 , 0.6 } \epsilon_\tau \in \{0.3, 0.4, 0.6\} ϵ τ ∈ { 0.3 , 0.4 , 0.6 } Step size settings:
Path graph experiments: μ = 8 × 10 − 3 \mu = 8 \times 10^{-3} μ = 8 × 1 0 − 3 Varying τ experiments: μ = 5 × 10 − 3 \mu = 5 \times 10^{-3} μ = 5 × 1 0 − 3 Hypercube experiments: Not explicitly specified Stochastic gradient: Randomly select sample n i n_i n i per iteration to compute ∇ J ^ k ( w k , i ) = h k , n i ( h k , n i T w k , i − γ k , n i ) \nabla\hat{J}_k(w_{k,i}) = h_{k,n_i}(h_{k,n_i}^T w_{k,i} - \gamma_{k,n_i}) ∇ J ^ k ( w k , i ) = h k , n i ( h k , n i T w k , i − γ k , n i ) Comparison method: Metropolis-Hastings weights (classical static weights) Path Graph (K=16, τ=7) :
Exact FTC (ϵ τ = 0 \epsilon_\tau=0 ϵ τ = 0 ): MSD converges to lowest steady-state errorModerate Approximation (ϵ τ = 0.3 \epsilon_\tau=0.3 ϵ τ = 0.3 ): Steady-state error slightly increases but remains significantly better than MetropolisHigh Approximation Error (ϵ τ = 0.6 \epsilon_\tau=0.6 ϵ τ = 0.6 ): Steady-state error increases further, but performance degradation is mildKey Findings :
Performance degradation is graceful , indicating Aug-DGM's robustness to FTC sequence imprecision Convergence is maintained even at ϵ τ = 0.6 \epsilon_\tau=0.6 ϵ τ = 0.6 Validates theoretical prediction: steady-state error grows as ϵ τ ( 1 − ϵ τ ) 2 \frac{\epsilon_\tau}{(1-\epsilon_\tau)^2} ( 1 − ϵ τ ) 2 ϵ τ Fixed K=16, Varying τ :
Increasing τ leads to increased steady-state error Validates theoretical O ( μ 2 τ 2 ) O(\mu^2\tau^2) O ( μ 2 τ 2 ) dependence Physical Interpretation :
Longer τ means agents drift along local gradients for more time during complete FTC sequence Local model differences accumulate, requiring smaller step sizes to compensate (condition μ ≤ O ( 1 / τ ) \mu \leq O(1/\tau) μ ≤ O ( 1/ τ ) ) Detailed Observations on Path Graph :
Centroid Error : Monotonically decreases (per iteration)Consensus Error :
May grow within τ-step periods Significantly decreases every τ steps Exhibits sawtooth pattern Exact FTC on Path Graph (Figure 4b):
Error grows within sequence (agent drift) Sharp decrease every τ=7 steps (consensus recovery) Perfectly demonstrates the dual time-scale characteristic predicted by theory Hypercube Graph (Figure 4a, K=8, τ=3):
Exact FTC significantly outperforms Metropolis Low τ makes FTC particularly effective Counter-intuitive Result on Path Graph (Figure 4b):
Exact FTC (τ=7, ϵ τ = 0 \epsilon_\tau=0 ϵ τ = 0 ): Moderate performanceApproximate FTC (τ=3, ϵ τ = 0.4 \epsilon_\tau=0.4 ϵ τ = 0.4 ): Best performance Metropolis (τ=1, ϵ τ = λ = 0.95 \epsilon_\tau=\lambda=0.95 ϵ τ = λ = 0.95 ): Worst performanceCore Insight :
Steady-state error depends on ratio τ 2 ( 1 − ϵ τ ) 2 \frac{\tau^2}{(1-\epsilon_\tau)^2} ( 1 − ϵ τ ) 2 τ 2 :
Exact FTC: 49 1 = 49 \frac{49}{1} = 49 1 49 = 49 Approximate FTC: 9 0.36 = 25 \frac{9}{0.36} = 25 0.36 9 = 25 (superior!) Metropolis: 1 0.0025 = 400 \frac{1}{0.0025} = 400 0.0025 1 = 400 This reveals that for large-diameter graphs, deliberately using shorter approximate FTC sequences can outperform pursuing exact but longer sequences .
Robustness Verification : Algorithm exhibits strong robustness to FTC imprecision, with ϵ τ \epsilon_\tau ϵ τ reaching 0.6 still convergingTheory Alignment : All trends match theoretical bounds both qualitatively and quantitativelyDesign Guidance : Reveals practical design trade-offs—short approximate sequences may outperform long exact sequencesDual Time-Scale : Experiments clearly demonstrate the non-monotonic behavior predicted by theoryClassical Methods : Diffusion 3 , DGD 4 , EXTRA 5 Gradient Tracking : NEXT 6 , Exact Diffusion 7 , D² 8,11 Static Weight Optimization : Metropolis-Hastings 2 , topology-specific optimal weights 12 Theoretical Foundations : Linear averaging consensus 21-23 Construction Methods :
Eigenvalue decomposition 18-20 Graph filtering techniques 25,26 Decentralized learning 24 Applications : Sparse communication 14,17 , accelerated convergence 24 General Time-Varying Matrices : 6,29-34 assume τ-connectivityDIGing 29 : Achieves ( 1 − Θ ( μ ν ) ) 1 / ( 2 τ ) (1-\Theta(\mu\nu))^{1/(2\tau)} ( 1 − Θ ( μν ) ) 1/ ( 2 τ ) convergence rate under strong convexityThis Paper's Advantage : Exploits periodicity to achieve 1 − Θ ( ν μ ) + Θ ( ν ϵ τ μ ) 1-\Theta(\nu\mu)+\Theta(\nu\epsilon_\tau\mu) 1 − Θ ( νμ ) + Θ ( ν ϵ τ μ ) amortized rateMethod Objective Composite Matrix Steady-State Performance Convergence Rate ATC-Diffusion Strongly Convex Static O ( μ σ 2 / K + μ 2 λ 2 σ 2 / ( 1 − λ ) ) O(\mu\sigma^2/K + \mu^2\lambda^2\sigma^2/(1-\lambda)) O ( μ σ 2 / K + μ 2 λ 2 σ 2 / ( 1 − λ )) 1 − Θ ( μ ν ) 1-\Theta(\mu\nu) 1 − Θ ( μν ) ATC-GT PL Condition Static O ( μ σ 2 / K + μ 2 λ 4 σ 2 / ( 1 − λ ) ) O(\mu\sigma^2/K + \mu^2\lambda^4\sigma^2/(1-\lambda)) O ( μ σ 2 / K + μ 2 λ 4 σ 2 / ( 1 − λ )) 1 − Θ ( μ ν ) 1-\Theta(\mu\nu) 1 − Θ ( μν ) DIGing Strongly Convex Time-Varying - ( 1 − Θ ( μ ν ) ) 1 / ( 2 τ ) (1-\Theta(\mu\nu))^{1/(2\tau)} ( 1 − Θ ( μν ) ) 1/ ( 2 τ ) NEXT-FTC Non-Convex Exact FTC O ( μ σ 2 / K + μ 2 τ 3 σ 2 ) O(\mu\sigma^2/K + \mu^2\tau^3\sigma^2) O ( μ σ 2 / K + μ 2 τ 3 σ 2 ) - This Paper Strongly Convex Approximate FTC O ( μ σ 2 / K + μ 2 τ 2 σ 2 / ( K ( 1 − ϵ τ ) 2 ) ) O(\mu\sigma^2/K + \mu^2\tau^2\sigma^2/(K(1-\epsilon_\tau)^2)) O ( μ σ 2 / K + μ 2 τ 2 σ 2 / ( K ( 1 − ϵ τ ) 2 )) 1 − Θ ( ν μ ) + Θ ( ν ϵ τ μ ) 1-\Theta(\nu\mu)+\Theta(\nu\epsilon_\tau\mu) 1 − Θ ( νμ ) + Θ ( ν ϵ τ μ )
This Paper's Advantages :
First analysis of approximate FTC (ϵ τ > 0 \epsilon_\tau>0 ϵ τ > 0 ) Compared to DIGing, amortized convergence rate is τ times faster Compared to NEXT, improved τ dependence (τ 2 \tau^2 τ 2 vs τ 3 \tau^3 τ 3 ) Theoretical Contributions :First complete convergence analysis for approximate FTC sequences Steady-state MSD: O ( μ ( σ 2 + β 2 ζ 2 ) ν K + μ 2 τ 2 δ 2 ( 1 + ϵ τ ) ( σ 2 + β 2 ζ 2 ) ν 2 K ( 1 − ϵ τ ) 2 + μ 2 τ 2 ( 1 + ϵ τ ) ( σ 2 + β 2 ζ 2 ) ( 1 − ϵ τ ) 2 ) O\left(\frac{\mu(\sigma^2+\beta^2\zeta^2)}{\nu K} + \frac{\mu^2\tau^2\delta^2(1+\epsilon_\tau)(\sigma^2+\beta^2\zeta^2)}{\nu^2K(1-\epsilon_\tau)^2} + \frac{\mu^2\tau^2(1+\epsilon_\tau)(\sigma^2+\beta^2\zeta^2)}{(1-\epsilon_\tau)^2}\right) O ( ν K μ ( σ 2 + β 2 ζ 2 ) + ν 2 K ( 1 − ϵ τ ) 2 μ 2 τ 2 δ 2 ( 1 + ϵ τ ) ( σ 2 + β 2 ζ 2 ) + ( 1 − ϵ τ ) 2 μ 2 τ 2 ( 1 + ϵ τ ) ( σ 2 + β 2 ζ 2 ) ) Convergence rate: γ = 1 − τ ν μ 8 + 3 τ ν ϵ τ ( 1 + ϵ τ ) μ 32 \gamma = 1 - \frac{\tau\nu\mu}{8} + \frac{3\tau\nu\epsilon_\tau(1+\epsilon_\tau)\mu}{32} γ = 1 − 8 τνμ + 32 3 τν ϵ τ ( 1 + ϵ τ ) μ Practical Guidance :Approximate FTC sequences are sufficiently effective in practice Optimal trade-off exists between τ and ϵ τ \epsilon_\tau ϵ τ Short approximate sequences may outperform long exact sequences Methodological Innovation :Dual time-scale analysis framework applicable to other periodic algorithms Maximum error recursion technique handles non-monotonic behavior Theoretical Constraints :Requires ϵ τ ≤ 0.75 \epsilon_\tau \leq 0.75 ϵ τ ≤ 0.75 for analytical guarantees (experiments show larger values still converge) Step size condition μ ≤ O ( K ( 1 − ϵ τ ) τ ) \mu \leq O\left(\frac{\sqrt{K(1-\epsilon_\tau)}}{\tau}\right) μ ≤ O ( τ K ( 1 − ϵ τ ) ) is restrictive for large τ Assumption Constraints :Strong convexity (Assumption 1): Limits applicability scope Gradient noise structure (Assumption 2): Bounded variance assumption τ-strong connectivity (Assumption 3): Requires sequence product graph connectivity Experimental Scope :Only linear regression tested Small network scale (K≤16) No large-scale deep learning scenarios Limited Comparisons :No comparison with other gradient tracking variants (e.g., Push-Sum GT) Lacks comparison with latest FTC construction methods Research directions suggested by the paper:
Extension to Non-Convex Settings : Current analysis limited to strong convexity; extend to general non-convex or PL conditionsAdaptive τ Selection : Dynamically adjust sequence length based on network stateDistributed FTC Construction : Learn and optimize FTC sequences in a decentralized mannerCommunication Efficiency : Study sparse-activation FTC sequences (not all edges communicate every step)Asynchronous Settings : Analyze approximate FTC under asynchronous updatesTime-Varying Networks : FTC design when network topology changes dynamicallyComplete Convergence Analysis : Logically sound from assumptions to main theorem with detailed proofs (9-page appendix)Innovative Analysis Techniques : Dual time-scale framework cleverly handles different evolution speeds of centroid and consensus errorsTight Bounds : Through careful matrix norm analysis and Young's inequality applications, achieves bounds with explicit dependenciesProblem-Oriented : Directly addresses the practical difficulty of obtaining exact FTC sequencesDesign Guidance : The τ 2 ( 1 − ϵ τ ) 2 \frac{\tau^2}{(1-\epsilon_\tau)^2} ( 1 − ϵ τ ) 2 τ 2 trade-off provides clear engineering guidanceRobustness Guarantees : Proves algorithm tolerates imprecision wellVariable Transformation : Two clever transformations simplify original coupled recursionsMaximum Error Recursion : Successfully reconciles two time scales by considering maximum error within periodsSpectral Radius Bound : Lemma 5's proof technique (exploiting block upper triangular structure) is instructiveThorough Theory Verification : Figures 2-4 systematically validate all theoretical predictionsDeep Insights : Figure 4b's counter-intuitive result (approximate outperforms exact) has important implicationsClear Visualization : Figure 1 perfectly demonstrates dual time-scale phenomenonConservative Conditions : ϵ τ ≤ 0.75 \epsilon_\tau \leq 0.75 ϵ τ ≤ 0.75 and step size conditions may be overly restrictiveLarge Constants : Constants in bounds (e.g., 720) may indicate loosenessStrong Convexity Restriction : Cannot directly apply to modern deep learning (non-convex)Simple Problems : Only linear regression tested; lacks complex tasks (e.g., neural network training)Limited Scale : K≤16 cannot reflect large-scale distributed system characteristicsSingle Baseline : Only compared with Metropolis; lacks comparison with other advanced methodsFTC Construction : Doesn't discuss how to practically obtain approximate FTC sequencesCommunication Cost : Doesn't quantify communication complexity (though mentions single time-scale)Heterogeneity : Doesn't consider agent computational heterogeneity or data distribution heterogeneityDense Notation : Extensive mathematical symbols may impact readabilityMain Theorem Placement : Theorem 1 appears late (page 6), hindering quick grasp of core resultsExperimental Details : Some hyperparameter choices (e.g., hypercube's μ) unexplainedSignificant Theoretical Contribution : Fills gap in approximate FTC analysis, provides foundation for future researchMethodological Value : Dual time-scale analysis framework generalizable to other periodic algorithmsCitation Potential : Expected to gain attention in decentralized optimization and federated learningModerate to High :
Positive: Provides theoretical support for practical system design Negative: Requires further engineering (e.g., FTC construction algorithms) Applicable Scenarios :
Distributed estimation in sensor networks Federated learning in edge computing Cooperative control in robot swarms Theory Reproducible : Complete proofs, verifiable derivationsExperiments :
Positive: Clear data generation process Negative: No code link; insufficient FTC matrix construction details Medium-Scale Networks (K=10-100): Theory most effectiveSparse Topologies : Path graphs, rings, trees, etc. with low connectivityStrongly Convex Problems : Logistic regression, ridge regression, certain SVM variantsCommunication-Limited : Scenarios requiring reduced communication roundsLarge-Scale Deep Learning : Non-convexity and scale exceed theoretical scopeDensely Connected Networks : Complete or near-complete graphs where classical methods sufficeExtreme Asynchrony : Theory assumes synchronous updatesDynamic Topologies : Frequent network structure changesFederated Learning : Combine client sampling with FTC designPrivacy Protection : FTC sequences may facilitate differential privacy mechanismsCompressed Communication : Study quantization effects on FTC sequencesOnline Learning : FTC strategies for time-varying objectivesHeterogeneous Systems : Adapt to computational and statistical heterogeneity2 A. H. Sayed, "Adaptation, Learning, and Optimization over Networks," Foundations and Trends in Machine Learning, 2014. (Decentralized optimization foundations)
17 E. D. H. Nguyen et al., "On graphs with finite-time consensus and their use in gradient tracking," SIAM J. Optim., 2025. (Exact FTC in gradient tracking)
24 A. Fainman and S. Vlaski, "Learned finite-time consensus for distributed optimization," EUSIPCO, 2024. (Learning FTC sequences)
29 A. Nedić et al., "Achieving geometric convergence for distributed optimization over time-varying graphs," SIAM J. Optim., 2017. (DIGing algorithm)
35 J. Xu et al., "Augmented distributed gradient methods for multi-agent optimization," CDC, 2015. (Original Aug-DGM algorithm)
Overall Assessment : This is an excellent paper with rigorous theory and clear contributions. The dual time-scale analysis framework represents methodological innovation, convergence guarantees for approximate FTC fill a theoretical gap, and the τ-ϵ_τ trade-off revealed by experiments has important practical value. Main limitations are the strong convexity assumption restricting applicability and relatively small experimental scale. Recommend future work extend to non-convex settings and validate on large-scale problems. Suitable for publication in top-tier optimization or signal processing journals.