2025-11-26T14:28:18.825152

On the Convergence of Decentralized Stochastic Gradient-Tracking with Finite-Time Consensus

Fainman, Vlaski

Algorithms for decentralized optimization and learning rely on local optimization steps coupled with combination steps over a graph. Recent works have demonstrated that using a time-varying sequence of matrices that achieves finite-time consensus can improve the communication and iteration complexity of decentralized optimization algorithms based on gradient tracking. In practice, a sequence of matrices satisfying the exact finite-time consensus property may not be available due to imperfect knowledge of the network topology, a limit on the length of the sequence, or numerical instabilities. In this work, we quantify the impact of approximate finite-time consensus sequences on the convergence of a gradient-tracking based decentralized optimization algorithm. Our results hold for any periodic sequence of combination matrices. We clarify the interplay between approximation error of the finite-time consensus sequence and the length of the sequence as well as typical problem parameters such as smoothness and gradient noise.

academic

On the Convergence of Decentralized Stochastic Gradient-Tracking with Finite-Time Consensus

Basic Information

Paper ID: 2505.23577
Title: On the Convergence of Decentralized Stochastic Gradient-Tracking with Finite-Time Consensus
Authors: Aaron Fainman, Stefan Vlaski (Imperial College London)
Categories: math.OC (Optimization and Control), eess.SP (Signal Processing)
Publication Date: November 24, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2505.23577
Preliminary Version: Published at Asilomar Conference on Signals, Systems, and Computers, 2025

Abstract

This paper investigates the convergence performance of decentralized optimization algorithms based on gradient-tracking when using approximate finite-time consensus (FTC) matrix sequences. In practical applications, matrix sequences satisfying exact FTC properties may be unavailable due to imperfect network topology knowledge, sequence length constraints, or numerical instability. The paper quantifies the impact of approximate FTC sequences on algorithm convergence, with results applicable to arbitrary periodic composite matrix sequences. It clarifies the interactions between FTC sequence approximation error, sequence length, and problem parameters such as smoothness and gradient noise.

Research Background and Motivation

Problem Background

In decentralized optimization, K agents cooperatively solve an aggregated optimization problem: $w^o = \arg\min_{w \in \mathbb{R}^M} J(w) = \arg\min_{w \in \mathbb{R}^M} \frac{1}{K}\sum_{k=1}^K J_k(w)$

Each agent can only access its local data and coordinates optimization through neighbor communication on a network graph.

Core Challenges

Performance bounds for decentralized optimization algorithms typically consist of two components:

Optimization Error: Arising from iterative gradient updates, representing an inherent lower bound for distributed algorithms
Consensus Error: Arising from limited network information diffusion, significantly affecting overall performance in sparsely connected networks

Limitations of Existing Methods

Classical Weight Rules (e.g., Metropolis-Hastings): Only guarantee asymptotic convergence
Exact FTC Sequences: Achieve exact consensus within finite steps τ, but are difficult to obtain in practice:
- Imperfect network topology knowledge
- Numerical methods (eigenvalue decomposition, graph filtering) introduce errors
- Sequence length constraints

Research Motivation

Although empirical evidence suggests approximate FTC sequences still provide significant benefits, theoretical analysis quantifying their impact is lacking. This paper fills this theoretical gap by analyzing the specific impact of approximation error ϵ_τ on algorithm convergence.

Core Contributions

Theoretical Guarantees: Provides complete performance bounds for gradient-tracking algorithms (Aug-DGM) using approximate FTC sequences, applicable to arbitrary periodic matrix sequences. Compared to existing bounds that don't exploit periodicity (e.g., DIGing's $(1-\Theta(\mu\nu))^{1/(2\tau)}$ convergence rate), this achieves a significantly improved amortized convergence rate of $1-\Theta(\nu\mu)+\Theta(\nu\epsilon_\tau\mu)$ .
Dual Time-Scale Analysis: Proposes a novel analytical framework that reconciles the algorithm's dual time-scale behavior:
- Single-step monotonic decrease of centroid error
- Periodic τ-step decrease of consensus error
Trade-off Characterization: Precisely quantifies the trade-off between approximation error ϵ_τ and sequence length τ, revealing that FTC sequences are particularly effective when τ is small, and in some cases approximate FTC outperforms exact FTC.
Tight Bound Analysis: Steady-state error is $O\left(\frac{\mu\sigma^2}{K} + \frac{\mu^2\tau^2\sigma^2}{K(1-\epsilon_\tau)^2} + \frac{\mu^2\tau^2\sigma^2}{(1-\epsilon_\tau)^2}\right)$ , clearly showing the impact of each parameter.

Methodology Details

Problem Formulation

Consider K agents on an undirected graph G=(N,E) cooperatively solving problem (1), where:

Each agent k can only access local objective $J_k(w) = \mathbb{E}Q_k(w;x_k)$
Agents can only exchange information with neighbors $\mathcal{N}_k$
Using periodic composite matrix sequences $\{A_i\}$ satisfying approximate FTC property: $\epsilon_\tau \triangleq \left\|A_\tau \cdots A_2 A_1 - \frac{1}{K}\mathbf{1}\mathbf{1}^T\right\|$

Algorithm Architecture: Aug-DGM with FTC

Agent k executes at iteration i:

Model Update: $w_{k,i} = \sum_{\ell \in \mathcal{N}_k} a_{\ell k,i}(w_{\ell,i-1} - g_{\ell,i-1})$

Gradient Tracking Update: $g_{k,i} = \sum_{\ell \in \mathcal{N}_k} a_{\ell k,i}\left(g_{\ell,i-1} + \mu\hat{\nabla}J_\ell(w_{\ell,i}) - \mu\hat{\nabla}J_\ell(w_{\ell,i-1})\right)$

Where:

$\mu$ is the step size
$A_i = A_{i\%\tau}$ cycles through the FTC sequence periodically
$\hat{\nabla}J_k(\cdot)$ is the stochastic gradient approximation

Network-Level Representation: $W_i = \mathcal{A}_i(W_{i-1} - G_{i-1})$ $G_i = \mathcal{A}_i(G_{i-1} + \mu\hat{\nabla}J(W_i) - \mu\hat{\nabla}J(W_{i-1}))$

Technical Innovations

1. Variable Transformation Technique

Simplifies analysis through two transformations:

First: $Y_i \triangleq G_i - \mu\mathcal{A}_i\hat{\nabla}J(W_i)$
Second: $Z_i \triangleq Y_i + \mu\mathcal{A}_i\nabla J(W_{c,i})$

Yielding coupled recursions: $W_i = \mathcal{A}_iW_{i-1} - \mathcal{A}_iZ_{i-1} + \text{drift terms}$ $Z_i = \mathcal{A}_iZ_{i-1} + \text{correction terms}$

2. Error Decomposition Strategy

Decomposes total error into:

Centroid Error: $\tilde{w}_{c,i} \triangleq w^o - w_{c,i}$ , where $w_{c,i} = \frac{1}{K}\sum_{k=1}^K w_{k,i}$
Consensus Error: $\hat{X}_i \triangleq [\hat{W}_i^T, \hat{Z}_i^T]^T$ , where $\hat{W}_i = \hat{I}W_i$ , $\hat{I} = (I_K - \frac{1}{K}\mathbf{1}\mathbf{1}^T)\otimes I_M$

3. Dual Time-Scale Analysis Framework

Consensus Error Analysis (Lemma 3): Traces back from iteration i to the nearest complete FTC sequence at $m\tau+1$ (where $m=\lfloor i/\tau \rfloor - 1$ ): $\hat{X}_i = \mathcal{G}_{i:m\tau+1}\hat{X}_{m\tau} - \mu\sum_{j=m\tau+1}^{i-1}\mathcal{G}_{i:j+1}h_j - \mu h_i - \text{noise terms}$

Key bound: $\mathbb{E}\|\hat{X}_i\|^2 \leq \theta_1\mathbb{E}\|\hat{X}_{m\tau}\|^2 + \theta_2\sum_{j=m\tau}^{i-1}\mathbb{E}\|\hat{X}_j\|^2 + \theta_3\sum_{j=m\tau}^{i-1}\mathbb{E}\|\tilde{w}_{c,j}\|^2 + \theta_4$

Where $\theta_1 = \frac{\epsilon_\tau}{2}(1+\epsilon_\tau)$ is nonzero only when $\epsilon_\tau>0$ .

Centroid Error Analysis (Lemma 4): Single-step recursion: $\mathbb{E}\|\tilde{w}_{c,i}\|^2 \leq \alpha_1\mathbb{E}\|\tilde{w}_{c,i-1}\|^2 + \alpha_2\mathbb{E}\|\hat{X}_{i-1}\|^2 + \alpha_3$

Where $\alpha_1 = 1 - \frac{\nu\mu}{2}$ (driven by strong convexity constant).

4. Maximum Error Recursion

To reconcile the two time scales, define:

$\omega_m = \max_{i\in S_m}\mathbb{E}\|\tilde{w}_{c,i}\|^2$
$\chi_m = \max_{i\in S_m}\mathbb{E}\|\hat{X}_i\|^2$

Where $S_m = \{(m-1)\tau, \ldots, m\tau-1\}$ is the m-th sequence block.

Derives coupled recursion (core result): $\begin{bmatrix}\omega_m \\ \chi_m\end{bmatrix} \leq H\begin{bmatrix}\omega_{m-1} \\ \chi_{m-1}\end{bmatrix} + p + O(\mu^3)$

Where: $H = \begin{bmatrix}1-\frac{\tau\nu\mu}{4} & \alpha_2\tau(1+\frac{3}{2}\theta_1) \\ \frac{3}{2}\tau\theta_3 & \frac{3}{2}(\theta_1+\tau\theta_2)\end{bmatrix}$

5. Spectral Radius Bound

Through careful analysis (Lemma 5): $\rho(H) \leq 1 - \frac{\tau\nu\mu}{8} + \frac{3\tau\nu\mu\epsilon_\tau(1+\epsilon_\tau)}{32} + O(\mu^3)$

Condition: $\mu \leq \frac{\nu\sqrt{K(1-\epsilon_\tau)}}{72\tau\delta\sqrt{\delta^2+\beta^2}}$

Key Technical Details

Perturbation Term Bounds (Lemma 2):
- Gradient noise term $v_i$ : $\mathbb{E}[\|v_i\|^2|W_{i-1}] \leq 9\beta^2\zeta^2 + 9\beta^2K\|\tilde{w}_{c,i-1}\|^2 + 9\beta^2\|\hat{W}_{i-1}\|^2 + 3\sigma^2$
- Drift term $h_i$ : $\mathbb{E}[\|h_i\|^2|W_{i-1}] \leq 3(2\delta^2+\frac{1}{2}\beta^2)\|\hat{W}_{i-1}\| + 2(\delta^2+\beta^2K)\|\tilde{w}_{c,i-1}\|^2 + \frac{3}{2}\beta^2\zeta^2 + \frac{1}{2}\sigma^2$
Young's Inequality Application: Systematically uses Young's inequality to separate cross terms, with parameter selection $\alpha_j = \frac{1}{\gamma-j}$ achieving optimal balance.
Matrix Norm Techniques: Exploits block upper triangular structure $\|\mathcal{G}_{i:m\tau+1}\| \leq \epsilon_\tau$ (when $i \geq (m+1)\tau$ ).

Experimental Setup

Datasets

Linear Regression Problem:

Data generation model: $\gamma_{k,n} = h_{k,n}^T w^o + v_n$
Feature dimension: $M=20$
Samples per agent: $N=30$
Feature distribution: $h_{k,n} \sim \mathcal{N}(0, I_M)$
Noise distribution: $v_{k,n} \sim \mathcal{N}(0, 0.1)$
Objective function: $J_k(w) = \frac{1}{2N}\sum_{n=1}^N(\gamma_{k,n} - h_{k,n}^T w)^2$

Network Topologies

Multiple graph structures tested:

Path Graph: K=16 agents, τ=7
Hypercube: K=8 agents, τ=3
Varying τ: Fixed K=16, varying τ

Evaluation Metrics

Mean Squared Deviation (MSD): $\text{MSD} = \mathbb{E}\|W_i - \mathbf{1}\otimes w^o\|^2$

FTC Sequence Generation

Exact FTC: Constructed using graph-theoretic methods satisfying $\epsilon_\tau=0$
Approximate FTC:
- Add noise to nonzero elements
- Adjust diagonal elements maintaining double stochasticity
- Test $\epsilon_\tau \in \{0.3, 0.4, 0.6\}$

Implementation Details

Step size settings:
- Path graph experiments: $\mu = 8 \times 10^{-3}$
- Varying τ experiments: $\mu = 5 \times 10^{-3}$
- Hypercube experiments: Not explicitly specified
Stochastic gradient: Randomly select sample $n_i$ per iteration to compute $\nabla\hat{J}_k(w_{k,i}) = h_{k,n_i}(h_{k,n_i}^T w_{k,i} - \gamma_{k,n_i})$
Comparison method: Metropolis-Hastings weights (classical static weights)

Experimental Results

Main Results

1. Approximation Error Impact (Figure 2)

Path Graph (K=16, τ=7):

Exact FTC ( $\epsilon_\tau=0$ ): MSD converges to lowest steady-state error
Moderate Approximation ( $\epsilon_\tau=0.3$ ): Steady-state error slightly increases but remains significantly better than Metropolis
High Approximation Error ( $\epsilon_\tau=0.6$ ): Steady-state error increases further, but performance degradation is mild

Key Findings:

Performance degradation is graceful, indicating Aug-DGM's robustness to FTC sequence imprecision
Convergence is maintained even at $\epsilon_\tau=0.6$
Validates theoretical prediction: steady-state error grows as $\frac{\epsilon_\tau}{(1-\epsilon_\tau)^2}$

2. Sequence Length Impact (Figure 3)

Fixed K=16, Varying τ:

Increasing τ leads to increased steady-state error
Validates theoretical $O(\mu^2\tau^2)$ dependence

Physical Interpretation:

Longer τ means agents drift along local gradients for more time during complete FTC sequence
Local model differences accumulate, requiring smaller step sizes to compensate (condition $\mu \leq O(1/\tau)$ )

3. Dual Time-Scale Behavior (Figures 1 and 4b)

Detailed Observations on Path Graph:

Centroid Error: Monotonically decreases (per iteration)
Consensus Error:
- May grow within τ-step periods
- Significantly decreases every τ steps
- Exhibits sawtooth pattern

Exact FTC on Path Graph (Figure 4b):

Error grows within sequence (agent drift)
Sharp decrease every τ=7 steps (consensus recovery)
Perfectly demonstrates the dual time-scale characteristic predicted by theory

4. Trade-off Between τ and ϵ_τ (Figure 4)

Hypercube Graph (Figure 4a, K=8, τ=3):

Exact FTC significantly outperforms Metropolis
Low τ makes FTC particularly effective

Counter-intuitive Result on Path Graph (Figure 4b):

Exact FTC (τ=7, $\epsilon_\tau=0$ ): Moderate performance
Approximate FTC (τ=3, $\epsilon_\tau=0.4$ ): Best performance
Metropolis (τ=1, $\epsilon_\tau=\lambda=0.95$ ): Worst performance

Core Insight: Steady-state error depends on ratio $\frac{\tau^2}{(1-\epsilon_\tau)^2}$ :

Exact FTC: $\frac{49}{1} = 49$
Approximate FTC: $\frac{9}{0.36} = 25$ (superior!)
Metropolis: $\frac{1}{0.0025} = 400$

This reveals that for large-diameter graphs, deliberately using shorter approximate FTC sequences can outperform pursuing exact but longer sequences.

Experimental Findings Summary

Robustness Verification: Algorithm exhibits strong robustness to FTC imprecision, with $\epsilon_\tau$ reaching 0.6 still converging
Theory Alignment: All trends match theoretical bounds both qualitatively and quantitatively
Design Guidance: Reveals practical design trade-offs—short approximate sequences may outperform long exact sequences
Dual Time-Scale: Experiments clearly demonstrate the non-monotonic behavior predicted by theory

Decentralized Optimization Foundations

Classical Methods: Diffusion 3, DGD 4, EXTRA 5
Gradient Tracking: NEXT 6, Exact Diffusion 7, D² 8,11
Static Weight Optimization: Metropolis-Hastings 2, topology-specific optimal weights 12

Finite-Time Consensus

Theoretical Foundations: Linear averaging consensus 21-23
Construction Methods:
- Eigenvalue decomposition 18-20
- Graph filtering techniques 25,26
- Decentralized learning 24
Applications: Sparse communication 14,17, accelerated convergence 24

Time-Varying Weight Analysis

General Time-Varying Matrices: 6,29-34 assume τ-connectivity
DIGing 29: Achieves $(1-\Theta(\mu\nu))^{1/(2\tau)}$ convergence rate under strong convexity
This Paper's Advantage: Exploits periodicity to achieve $1-\Theta(\nu\mu)+\Theta(\nu\epsilon_\tau\mu)$ amortized rate

Comparative Analysis (Table I)

Method	Objective	Composite Matrix	Steady-State Performance	Convergence Rate
ATC-Diffusion	Strongly Convex	Static	$O(\mu\sigma^2/K + \mu^2\lambda^2\sigma^2/(1-\lambda))$	$1-\Theta(\mu\nu)$
ATC-GT	PL Condition	Static	$O(\mu\sigma^2/K + \mu^2\lambda^4\sigma^2/(1-\lambda))$	$1-\Theta(\mu\nu)$
DIGing	Strongly Convex	Time-Varying	-	$(1-\Theta(\mu\nu))^{1/(2\tau)}$
NEXT-FTC	Non-Convex	Exact FTC	$O(\mu\sigma^2/K + \mu^2\tau^3\sigma^2)$	-
This Paper	Strongly Convex	Approximate FTC	$O(\mu\sigma^2/K + \mu^2\tau^2\sigma^2/(K(1-\epsilon_\tau)^2))$	$1-\Theta(\nu\mu)+\Theta(\nu\epsilon_\tau\mu)$

This Paper's Advantages:

First analysis of approximate FTC ( $\epsilon_\tau>0$ )
Compared to DIGing, amortized convergence rate is τ times faster
Compared to NEXT, improved τ dependence ( $\tau^2$ vs $\tau^3$ )

Conclusions and Discussion

Main Conclusions

Theoretical Contributions:
- First complete convergence analysis for approximate FTC sequences
- Steady-state MSD: $O\left(\frac{\mu(\sigma^2+\beta^2\zeta^2)}{\nu K} + \frac{\mu^2\tau^2\delta^2(1+\epsilon_\tau)(\sigma^2+\beta^2\zeta^2)}{\nu^2K(1-\epsilon_\tau)^2} + \frac{\mu^2\tau^2(1+\epsilon_\tau)(\sigma^2+\beta^2\zeta^2)}{(1-\epsilon_\tau)^2}\right)$
- Convergence rate: $\gamma = 1 - \frac{\tau\nu\mu}{8} + \frac{3\tau\nu\epsilon_\tau(1+\epsilon_\tau)\mu}{32}$
Practical Guidance:
- Approximate FTC sequences are sufficiently effective in practice
- Optimal trade-off exists between τ and $\epsilon_\tau$
- Short approximate sequences may outperform long exact sequences
Methodological Innovation:
- Dual time-scale analysis framework applicable to other periodic algorithms
- Maximum error recursion technique handles non-monotonic behavior

Limitations

Theoretical Constraints:
- Requires $\epsilon_\tau \leq 0.75$ for analytical guarantees (experiments show larger values still converge)
- Step size condition $\mu \leq O\left(\frac{\sqrt{K(1-\epsilon_\tau)}}{\tau}\right)$ is restrictive for large τ
Assumption Constraints:
- Strong convexity (Assumption 1): Limits applicability scope
- Gradient noise structure (Assumption 2): Bounded variance assumption
- τ-strong connectivity (Assumption 3): Requires sequence product graph connectivity
Experimental Scope:
- Only linear regression tested
- Small network scale (K≤16)
- No large-scale deep learning scenarios
Limited Comparisons:
- No comparison with other gradient tracking variants (e.g., Push-Sum GT)
- Lacks comparison with latest FTC construction methods

Future Directions

Research directions suggested by the paper:

Extension to Non-Convex Settings: Current analysis limited to strong convexity; extend to general non-convex or PL conditions
Adaptive τ Selection: Dynamically adjust sequence length based on network state
Distributed FTC Construction: Learn and optimize FTC sequences in a decentralized manner
Communication Efficiency: Study sparse-activation FTC sequences (not all edges communicate every step)
Asynchronous Settings: Analyze approximate FTC under asynchronous updates
Time-Varying Networks: FTC design when network topology changes dynamically

In-Depth Evaluation

Strengths

1. Theoretical Rigor

Complete Convergence Analysis: Logically sound from assumptions to main theorem with detailed proofs (9-page appendix)
Innovative Analysis Techniques: Dual time-scale framework cleverly handles different evolution speeds of centroid and consensus errors
Tight Bounds: Through careful matrix norm analysis and Young's inequality applications, achieves bounds with explicit dependencies

2. Practical Value

Problem-Oriented: Directly addresses the practical difficulty of obtaining exact FTC sequences
Design Guidance: The $\frac{\tau^2}{(1-\epsilon_\tau)^2}$ trade-off provides clear engineering guidance
Robustness Guarantees: Proves algorithm tolerates imprecision well

3. Methodological Innovation

Variable Transformation: Two clever transformations simplify original coupled recursions
Maximum Error Recursion: Successfully reconciles two time scales by considering maximum error within periods
Spectral Radius Bound: Lemma 5's proof technique (exploiting block upper triangular structure) is instructive

4. Experimental Design

Thorough Theory Verification: Figures 2-4 systematically validate all theoretical predictions
Deep Insights: Figure 4b's counter-intuitive result (approximate outperforms exact) has important implications
Clear Visualization: Figure 1 perfectly demonstrates dual time-scale phenomenon

Weaknesses

1. Theoretical Limitations

Conservative Conditions: $\epsilon_\tau \leq 0.75$ and step size conditions may be overly restrictive
Large Constants: Constants in bounds (e.g., 720) may indicate looseness
Strong Convexity Restriction: Cannot directly apply to modern deep learning (non-convex)

2. Experimental Insufficiency

Simple Problems: Only linear regression tested; lacks complex tasks (e.g., neural network training)
Limited Scale: K≤16 cannot reflect large-scale distributed system characteristics
Single Baseline: Only compared with Metropolis; lacks comparison with other advanced methods

3. Practical Concerns

FTC Construction: Doesn't discuss how to practically obtain approximate FTC sequences
Communication Cost: Doesn't quantify communication complexity (though mentions single time-scale)
Heterogeneity: Doesn't consider agent computational heterogeneity or data distribution heterogeneity

4. Writing Details

Dense Notation: Extensive mathematical symbols may impact readability
Main Theorem Placement: Theorem 1 appears late (page 6), hindering quick grasp of core results
Experimental Details: Some hyperparameter choices (e.g., hypercube's μ) unexplained

Impact Assessment

Academic Impact

Significant Theoretical Contribution: Fills gap in approximate FTC analysis, provides foundation for future research
Methodological Value: Dual time-scale analysis framework generalizable to other periodic algorithms
Citation Potential: Expected to gain attention in decentralized optimization and federated learning

Practical Value

Moderate to High:
- Positive: Provides theoretical support for practical system design
- Negative: Requires further engineering (e.g., FTC construction algorithms)
Applicable Scenarios:
- Distributed estimation in sensor networks
- Federated learning in edge computing
- Cooperative control in robot swarms

Reproducibility

Theory Reproducible: Complete proofs, verifiable derivations
Experiments:
- Positive: Clear data generation process
- Negative: No code link; insufficient FTC matrix construction details

Applicable Scenarios

Best Suited For

Medium-Scale Networks (K=10-100): Theory most effective
Sparse Topologies: Path graphs, rings, trees, etc. with low connectivity
Strongly Convex Problems: Logistic regression, ridge regression, certain SVM variants
Communication-Limited: Scenarios requiring reduced communication rounds

Not Suitable For

Large-Scale Deep Learning: Non-convexity and scale exceed theoretical scope
Densely Connected Networks: Complete or near-complete graphs where classical methods suffice
Extreme Asynchrony: Theory assumes synchronous updates
Dynamic Topologies: Frequent network structure changes

Potential Extensions

Federated Learning: Combine client sampling with FTC design
Privacy Protection: FTC sequences may facilitate differential privacy mechanisms
Compressed Communication: Study quantization effects on FTC sequences
Online Learning: FTC strategies for time-varying objectives
Heterogeneous Systems: Adapt to computational and statistical heterogeneity

Key References

2 A. H. Sayed, "Adaptation, Learning, and Optimization over Networks," Foundations and Trends in Machine Learning, 2014. (Decentralized optimization foundations)

17 E. D. H. Nguyen et al., "On graphs with finite-time consensus and their use in gradient tracking," SIAM J. Optim., 2025. (Exact FTC in gradient tracking)

24 A. Fainman and S. Vlaski, "Learned finite-time consensus for distributed optimization," EUSIPCO, 2024. (Learning FTC sequences)

29 A. Nedić et al., "Achieving geometric convergence for distributed optimization over time-varying graphs," SIAM J. Optim., 2017. (DIGing algorithm)

35 J. Xu et al., "Augmented distributed gradient methods for multi-agent optimization," CDC, 2015. (Original Aug-DGM algorithm)

Overall Assessment: This is an excellent paper with rigorous theory and clear contributions. The dual time-scale analysis framework represents methodological innovation, convergence guarantees for approximate FTC fill a theoretical gap, and the τ-ϵ_τ trade-off revealed by experiments has important practical value. Main limitations are the strong convexity assumption restricting applicability and relatively small experimental scale. Recommend future work extend to non-convex settings and validate on large-scale problems. Suitable for publication in top-tier optimization or signal processing journals.