2025-11-10T02:47:02.164832

Central Limit Theorems for Asynchronous Averaged Q-Learning

Liu
This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.
academic

Central Limit Theorems for Asynchronous Averaged Q-Learning

Basic Information

  • Paper ID: 2509.18964
  • Title: Central Limit Theorems for Asynchronous Averaged Q-Learning
  • Author: Xingtu Liu (Simon Fraser University)
  • Classification: cs.LG math.OC stat.ML
  • Conference: OPT2025: 17th Annual Workshop on Optimization for Machine Learning
  • Paper Link: https://arxiv.org/abs/2509.18964

Abstract

This paper establishes central limit theorems (CLTs) for Polyak-Ruppert averaged Q-learning under asynchronous updates. The authors prove a non-asymptotic CLT whose convergence rate in the Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, discount factor, and exploration quality. Additionally, a functional CLT is derived, demonstrating weak convergence of the partial sum process to Brownian motion.

Research Background and Motivation

Problem Background

  1. Importance of Q-Learning: Q-learning is one of the most widely used algorithms in reinforcement learning, learning optimal action-value functions directly from empirical trajectories. It has achieved tremendous success in domains such as Atari games, Go, robotic manipulation, and large language model alignment.
  2. Challenges in Theoretical Analysis:
    • Q-learning can be interpreted as an instance of stochastic approximation (SA), but asynchronous Q-learning is a nonlinear SA problem with Markovian noise
    • Compared to linear SA and temporal difference (TD) learning, Q-learning analysis is more challenging due to its nonlinearity, non-smooth operators, and non-stationary processes
    • Asynchronous updates further introduce Markovian noise, increasing analytical complexity
  3. Limitations of Existing Work:
    • Previous work established functional CLTs for synchronous Q-learning, but synchronous Q-learning only considers martingale noise
    • Zhang and Xie (2024) established functional CLTs for asynchronous Q-learning with constant step sizes, but constant step sizes do not satisfy necessary conditions for establishing non-asymptotic CLTs
    • Currently, no non-asymptotic CLT for Q-learning exists, even in synchronous settings

Research Motivation

Establishing central limit theorems is crucial for understanding the statistical properties of algorithms. This asymptotic normality is significant for uncertainty quantification and statistical inference in reinforcement learning.

Core Contributions

  1. First Non-Asymptotic CLT for Q-Learning: Proves a non-asymptotic CLT for asynchronous averaged Q-learning with convergence rate O~((SA)1/2K1/6ρ2(1γ)3)\tilde{O}((|S||A|)^{1/2}K^{-1/6}\rho^{-2}(1-\gamma)^{-3})
  2. Functional Central Limit Theorem: Establishes a functional CLT for asynchronous Q-learning with decaying step sizes, showing weak convergence of the partial sum process to Brownian motion
  3. Explicit Dependency Relationships: The convergence rate explicitly reflects dependencies on the number of iterations KK, state-action space size SA|S||A|, discount factor γ\gamma, and exploration quality ρ\rho
  4. Technical Innovation: Addresses analytical challenges posed by nonlinearity, Markovian noise, and non-smooth operators

Methodology Details

Problem Formulation

Consider an infinite-horizon discounted Markov Decision Process (MDP) M=S,A,P,r,γM = \langle S, A, P, r, \gamma \rangle, where:

  • SS: state space
  • AA: action space
  • P:S×AΔSP: S \times A \rightarrow \Delta_S: transition probability function
  • γ[0,1)\gamma \in [0,1): discount factor

The goal is to learn the optimal Q-function Q=maxπQπQ^* = \max_\pi Q^\pi.

Asynchronous Q-Learning Algorithm

Asynchronous Q-learning maintains a Q-function estimator QkQ_k with update rule: Qk+1=Qk+αk(FkQk)Q_{k+1} = Q_k + \alpha_k(F_k - Q_k)

where:

  • Fk=F(Qk,yk)F_k = F(Q_k, y_k), yk=(sk,ak,sk+1)y_k = (s_k, a_k, s_{k+1})
  • [F(Qk,sk,ak,sk+1)](s,a)=1{(sk,ak)=(s,a)}Γ(Qk,sk,ak,sk+1)+Qk(s,a)[F(Q_k, s_k, a_k, s_{k+1})](s,a) = \mathbf{1}_{\{(s_k,a_k)=(s,a)\}}\Gamma(Q_k, s_k, a_k, s_{k+1}) + Q_k(s,a)
  • Γ(Qk,sk,ak,sk+1)=rk(sk,ak)+γmaxaQk(sk+1,a)Qk(sk,ak)\Gamma(Q_k, s_k, a_k, s_{k+1}) = r_k(s_k, a_k) + \gamma\max_a Q_k(s_{k+1}, a) - Q_k(s_k, a_k)

Key Assumptions

Assumption 1: There exists an optimal policy π\pi^* such that for QRS×AQ \in \mathbb{R}^{|S|\times|A|}: (PπPπ)(QQ)LQQ2\|(P^\pi - P^{\pi^*})(Q-Q^*)\|_\infty \leq L\|Q-Q^*\|_2^\infty

Assumption 2: {yk}k0\{y_k\}_{k \geq 0} is an irreducible and aperiodic finite-state Markov chain.

Step Size Selection

Polynomial step sizes are chosen as αk=α(k+b)β\alpha_k = \alpha(k+b)^{-\beta}, where α,b>0\alpha, b > 0, β(0.5,1)\beta \in (0.5, 1).

Rationale for this choice:

  1. Satisfies key conditions for the Polyak-Juditsky averaging scheme
  2. Constant step sizes violate conditions (i) and (iii); linear step sizes violate condition (ii)
  3. Polynomial step sizes satisfy all necessary conditions

Main Theoretical Results

Non-Asymptotic Central Limit Theorem

Theorem 4: Under Assumptions 1 and 2: W1(K1/2k=1KΔk,N~)(SA)1/2ρ(1γ)2K1/2O~((ρ(1γ))β21β+Kβ/2ρ1(1γ)1+K1β+K1β2ρ1β(1γ)β)W_1\left(K^{-1/2}\sum_{k=1}^K \Delta_k, \tilde{N}\right) \leq \frac{(|S||A|)^{1/2}}{\rho(1-\gamma)^2K^{1/2}} \cdot \tilde{O}\left((\rho(1-\gamma))^{\frac{\beta-2}{1-\beta}} + K^{\beta/2}\rho^{-1}(1-\gamma)^{-1} + K^{1-\beta} + K^{\frac{1-\beta}{2}}\rho^{-1-\beta}(1-\gamma)^{-\beta}\right)

where Δk=QkQ\Delta_k = Q_k - Q^* and N~=(A1ΣA)1/2N(0,I)\tilde{N} = (A^{-1}\Sigma A^{-\top})^{1/2}N(0,I).

Corollary 5: When β=2/3\beta = 2/3, the convergence rate simplifies to: W1(K1/2k=1KΔk,(A1ΣA)1/2N(0,I))O~((SA)1/2K1/6ρ2(1γ)3)W_1\left(K^{-1/2}\sum_{k=1}^K \Delta_k, (A^{-1}\Sigma A^{-\top})^{1/2}N(0,I)\right) \leq \tilde{O}\left(\frac{(|S||A|)^{1/2}}{K^{1/6}\rho^2(1-\gamma)^3}\right)

Functional Central Limit Theorem

Theorem 6: Under the setting of Theorem 4, the partial sum process ΦK(ζ)=K1/2k=1ζKΔk\Phi_K(\zeta) = K^{-1/2}\sum_{k=1}^{\lfloor\zeta K\rfloor}\Delta_k converges weakly on D[0,1]D[0,1] to (A1ΣA)1/2B()(A^{-1}\Sigma A^{-\top})^{1/2}B(\cdot), where B()B(\cdot) is standard Brownian motion.

Technical Innovation and Proof Strategy

Main Technical Challenges

  1. Nonlinearity: Q-learning is nonlinear SA, more complex than linear SA
  2. Markovian Noise: Asynchronous updates introduce non-i.i.d. Markovian noise
  3. Non-Smooth Operators: The empirical Bellman operator in asynchronous Q-learning is non-smooth

Proof Strategy

  1. Upper and Lower Bound Techniques: Introduces upper bound sequence Δk\Delta_k^{\uparrow} and lower bound sequence Δk\Delta_k^{\downarrow}, utilizing the squeeze theorem
  2. Term Decomposition: Decomposes k=1KΔk\sum_{k=1}^K \Delta_k into six terms:
    • Term (1): Initial error term
    • Term (2): Nonlinearity error term
    • Term (3): Markovian noise term
    • Terms (4-5): Higher-order correction terms
    • Term (6): Martingale difference sequence
  3. Poisson Equation Technique: Transforms Markovian noise into martingale difference sequences
  4. Martingale Central Limit Theorem: Applies the non-asymptotic martingale CLT from Srikant (2024)

Stochastic Approximation Theory

  • Polyak and Juditsky (1992): Classical variance reduction technique through averaging
  • Anastasiou et al. (2019): Non-asymptotic CLT for Polyak-Ruppert averaged SGD
  • Mou et al. (2020): Non-asymptotic CLT for linear SA

CLTs in Reinforcement Learning

  • Xie and Zhang (2022), Li et al. (2023): Functional CLTs for synchronous Q-learning
  • Zhang and Xie (2024): Functional CLT for constant step-size asynchronous Q-learning
  • Srikant (2024), Samsonov et al. (2024): Non-asymptotic CLTs for TD learning

Conclusions and Discussion

Main Conclusions

  1. Establishes the first non-asymptotic CLT for Q-learning with convergence rate explicitly dependent on problem parameters
  2. Proves weak convergence of the partial sum process in asynchronous Q-learning
  3. Provides theoretical foundation for uncertainty quantification in reinforcement learning

Limitations

  1. Requires strong Lipschitz assumptions (Assumption 1)
  2. Only considers finite state-action spaces
  3. Convergence rate may not be optimal

Future Directions

  1. Improve convergence rates
  2. Extend beyond 1-Wasserstein distance to other metrics
  3. Consider function approximation settings

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contribution: First establishment of non-asymptotic CLT for Q-learning, filling an important theoretical gap
  2. Technical Innovation: Cleverly combines upper/lower bound techniques, Poisson equations, and martingale CLT to overcome technical challenges
  3. Complete Results: Provides both non-asymptotic and functional CLTs
  4. Explicit Dependencies: Convergence rate clearly reflects the impact of each parameter

Weaknesses

  1. Strong Assumptions: Lipschitz assumptions may be difficult to verify in practice
  2. Convergence Rate: The K1/6K^{-1/6} convergence rate is relatively slow
  3. Finite State Spaces: Does not address continuous state spaces or function approximation

Impact

  1. Theoretical Value: Provides new tools and perspectives for Q-learning theoretical analysis
  2. Practical Significance: Establishes theoretical foundation for uncertainty quantification in reinforcement learning algorithms
  3. Methodological: Proof techniques are generalizable to other nonlinear SA problems

Applicable Scenarios

  1. Theoretical analysis of tabular reinforcement learning problems
  2. Convergence analysis of asynchronous update algorithms
  3. Statistical inference and confidence interval construction in reinforcement learning

References

  • Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging.
  • Xie, C. and Zhang, Z. (2022). A statistical online inference approach in averaged stochastic approximation.
  • Zhang, Y. and Xie, Q. (2024). Constant stepsize q-learning: Distributional convergence, bias and extrapolation.