2025-11-10T03:10:07.820308

Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives

An, Lu
The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games. We analyze the two-timescale GDA by investigating the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings. In particular, for finite-dimensional quadratic min-max games, we obtain long-time convergence in near quasi-static regimes through the hypocoercivity method. For mean-field GDA dynamics, we investigate convergence under a finite-scale ratio using a mixed synchronous-reflection coupling technique.
academic

Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives

Basic Information

  • Paper ID: 2501.17122
  • Title: Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives
  • Authors: Jing An, Jianfeng Lu (Duke University)
  • Classification: math.OC cs.LG cs.NA math.NA
  • Publication Date: January 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.17122

Abstract

Two-timescale gradient descent-ascent (GDA) algorithms are classical gradient methods for finding Nash equilibria in minimax games. This paper analyzes two-timescale GDA in both finite-dimensional and mean-field settings by investigating the influence of learning rate ratios on convergence behavior. For finite-dimensional quadratic minimax games, long-time convergence is obtained in the near-quasistatic regime through the hypocoercivity method. For mean-field GDA dynamics, convergence under finite scale ratios is studied using a hybrid synchronous-reflection coupling technique.

Research Background and Motivation

  1. Core Problem: Minimax optimization problems are ubiquitous in machine learning, including generative adversarial networks (GANs), multi-agent reinforcement learning, and optimal transport. Standard gradient descent-ascent algorithms may converge to limit cycles or diverge in nonconvex-nonconcave settings.
  2. Significance: Two-timescale GDA, which employs different learning rates for gradient descent and ascent updates, has become a popular alternative for addressing nonconvex-nonconcave difficulties. Understanding how learning rate ratios affect convergence behavior is crucial for algorithm design.
  3. Existing Limitations:
    • Optimal convergence results in finite dimensions require strong assumptions
    • In the mean-field setting, existing results are primarily restricted to quasistatic regimes (η ≫ 1 or η ≪ 1)
    • Lack of quantitative analysis regarding the learning rate ratio η
  4. Research Motivation: To answer the key question: "How does two-timescale GDA converge as a function of the learning rate ratio η?" and provide quantitative answers for both finite-dimensional and infinite-dimensional cases.

Core Contributions

  1. Finite-Dimensional Analysis: Analyzes two-timescale GDA dynamics for quadratic games through the hypocoercivity method, constructing Lyapunov functions to quantitatively estimate the relationship between convergence rate and learning rate ratio η.
  2. Preconditioner Design: Discusses how to design preconditioners for two-timescale dynamics to reduce sensitivity to extreme η values and improve convergence.
  3. Mean-Field Analysis: Studies convergence for entropy-regularized minimax problems using a hybrid synchronous-reflection coupling method, applicable to finite ranges of η and locally nonconvex-nonconcave objective functions.
  4. Unified Convergence Rates: Obtains convergence rates of the form min{√η, 1/√η} or min{1, η} in both settings, indicating that optimal η selection should be close to 1 rather than in the quasistatic regime.

Methodology Details

Problem Formulation

Finite-Dimensional Case: Consider quadratic games

min_{x∈ℝⁿ} max_{y∈ℝᵐ} K(x,y) = min_{x∈ℝⁿ} max_{y∈ℝᵐ} {½x⊤Qx + x⊤Py - ½y⊤Ry}

Infinite-Dimensional Case: Entropy-regularized minimax problems

min_{p∈P(X)} max_{q∈P(Y)} E_β(p,q) := ∫∫ K(x,y)p(x)q(y)dxdy + β⁻¹H(p) - β⁻¹H(q)

Model Architecture

Finite-Dimensional Two-Timescale GDA Dynamics

ẋ(t) = -∇_x K(x,y) = -Qx - Py
ẏ(t) = η∇_y K(x,y) = -ηRy + ηP⊤x

Through rescaling z(t) = √η x(t), the system can be written as:

φ̇(t) = -Dφ(t) - √η Lφ(t)

where D = Q 0; 0 ηR is a symmetric matrix and L = 0 P; -P⊤ 0 is a skew-symmetric matrix.

Mean-Field GDA Dynamics

∂_t p_t = ∇_x · (p_t ∫ ∇_x K(x,y)q_t(y)dy) + β⁻¹Δ_x p_t
∂_t q_t = η(-∇_y · (q_t ∫ ∇_y K(x,y)p_t(x)dx) + β⁻¹Δ_y q_t)

Technical Innovations

1. Hypocoercivity Method

Constructs a modified norm as a Lyapunov function:

H(φ) = ½‖φ‖² - ε⟨Mφ,φ⟩

where M = -(I + (LΠ)⊤LΠ)⁻¹(LΠ)⊤ and Π is a projection operator.

Key Assumptions:

  • Microscopic Coercivity: ⟨Sφ,φ⟩ ≥ λ‖(I-Π)φ‖²
  • Macroscopic Coercivity: ‖LΠφ‖² ≥ λ_L‖Πφ‖²

2. Hybrid Synchronous-Reflection Coupling

For the mean-field case, employs regularized reflection functions to avoid dependence on coupling times in local regions:

r_c^i(Z_t,Q_t)² + s_c^i(Z_t,Q_t)² = 1

Constructs distance function ρ_t = f₁(r₁(t)) + γf₂(r₂(t)), where f₁, f₂ are strictly increasing concave functions.

Experimental Setup

Theoretical Analysis Verification

The paper primarily provides theoretical analysis, verified through numerical experiments:

  • Randomly generated 10×10 symmetric positive semidefinite matrices Q, R and matrix P
  • η ranges from 0.01 to 10
  • Verification of the relationship between minimum eigenvalues and η

Evaluation Metrics

  • Finite-Dimensional: Convergence rate of the form exp(-Λmin{√η, 1/√η}s)
  • Mean-Field: Wasserstein-1 distance convergence rate W₁((p_t,q_t), (p*,q*)) ≤ Ae^{-cmin{1,η}t}W₁((p₀,q₀), (p*,q*))

Experimental Results

Main Theoretical Results

Theorem 3.1 (Finite-Dimensional Convergence)

Under appropriate assumptions, there exist constants C, Λ > 0 such that:

‖φ(s)‖² ≤ C exp(-Λ min{√η, 1/√η}s)‖φ₀‖²

Returning to original variables:

η‖x(t)‖² + ‖y(t)‖² ≤ Ce^{-Λmin{1,η}t}(η‖x(0)‖² + ‖y(0)‖²)

Theorem 5.1 (Mean-Field Convergence)

Under Assumption 5, if R ≤ √(2πβ⁻¹)min{√(m_x⁻¹), √(m_y⁻¹)} and gradient Lipschitz conditions are satisfied, then:

W₁((p_t,q_t), (p*,q*)) ≤ Amax{1,γ}e^{-ct}W₁((p₀,q₀), (p*,q*))

where c < min{c₁, ηc₂}.

Key Findings

  1. Optimal Learning Rate Ratio: Both settings indicate that η ≈ 1 is optimal, rather than in the quasistatic regime
  2. Unified Convergence Pattern: Convergence rates in both settings exhibit the form min{√η, 1/√η} or min{1, η}
  3. Necessity of Preconditioning: Extreme η values lead to deteriorated condition numbers, requiring appropriate preconditioners
  1. Two-Timescale Methods: Including classical two-timescale stochastic approximation, distributed optimization, and fixed-point finding in reinforcement learning
  2. Hypocoercivity Theory: Originally used for analyzing Boltzmann and Fokker-Planck equations, providing a variational alternative to spectral analysis
  3. Coupling Methods: Powerful tools in probability theory for comparing random variables, recently extended to contraction rate estimation for Langevin dynamics

Conclusions and Discussion

Main Conclusions

  1. The convergence behavior of two-timescale GDA strongly depends on the learning rate ratio η
  2. Optimal η selection should be close to 1, avoiding the quasistatic regime
  3. Hypocoercivity and coupling methods provide effective tools for analysis

Limitations

  1. Finite-dimensional analysis is restricted to quadratic games
  2. Mean-field analysis requires strong regularization assumptions
  3. Continuous-time analysis may not directly apply to discretized algorithms

Future Directions

  1. Extend the hypocoercivity method to nonlinear drift structures in mean-field GDA
  2. Study more general nonconvex-nonconcave objective functions
  3. Analyze the effects of discretization errors

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete convergence analysis with explicit convergence rates
  2. Methodological Innovation: Cleverly combines hypocoercivity and coupling methods to handle problems of different dimensions
  3. Practical Guidance: Provides theoretical guidance for learning rate selection
  4. Technical Depth: Addresses challenging nonlinear and infinite-dimensional problems

Weaknesses

  1. Application Scope: Finite-dimensional analysis is limited to quadratic cases with limited practical applicability
  2. Assumption Conditions: Mean-field analysis requires numerous technical assumptions
  3. Numerical Verification: Lacks large-scale numerical experiments to verify theoretical results

Impact

  1. Theoretical Contribution: Provides a new analytical framework for two-timescale optimization algorithms
  2. Methodological Value: The hypocoercivity method may be applicable to other optimization problems
  3. Practical Guidance: Provides algorithm designers with theoretical justification for learning rate selection

Applicable Scenarios

  1. Minimax optimization problems requiring theoretical guarantees
  2. Large-scale distributed game problems
  3. Stability analysis in generative model training

References

The paper cites 56 relevant references covering important works in optimization theory, probability theory, partial differential equations, and other fields, providing a solid theoretical foundation for the research.