2025-11-10T02:47:02.164832

Central Limit Theorems for Asynchronous Averaged Q-Learning

Liu

This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.

academic

Central Limit Theorems for Asynchronous Averaged Q-Learning

Basic Information

Paper ID: 2509.18964
Title: Central Limit Theorems for Asynchronous Averaged Q-Learning
Author: Xingtu Liu (Simon Fraser University)
Classification: cs.LG math.OC stat.ML
Conference: OPT2025: 17th Annual Workshop on Optimization for Machine Learning
Paper Link: https://arxiv.org/abs/2509.18964

Abstract

This paper establishes central limit theorems (CLTs) for Polyak-Ruppert averaged Q-learning under asynchronous updates. The authors prove a non-asymptotic CLT whose convergence rate in the Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, discount factor, and exploration quality. Additionally, a functional CLT is derived, demonstrating weak convergence of the partial sum process to Brownian motion.

Research Background and Motivation

Problem Background

Importance of Q-Learning: Q-learning is one of the most widely used algorithms in reinforcement learning, learning optimal action-value functions directly from empirical trajectories. It has achieved tremendous success in domains such as Atari games, Go, robotic manipulation, and large language model alignment.
Challenges in Theoretical Analysis:
- Q-learning can be interpreted as an instance of stochastic approximation (SA), but asynchronous Q-learning is a nonlinear SA problem with Markovian noise
- Compared to linear SA and temporal difference (TD) learning, Q-learning analysis is more challenging due to its nonlinearity, non-smooth operators, and non-stationary processes
- Asynchronous updates further introduce Markovian noise, increasing analytical complexity
Limitations of Existing Work:
- Previous work established functional CLTs for synchronous Q-learning, but synchronous Q-learning only considers martingale noise
- Zhang and Xie (2024) established functional CLTs for asynchronous Q-learning with constant step sizes, but constant step sizes do not satisfy necessary conditions for establishing non-asymptotic CLTs
- Currently, no non-asymptotic CLT for Q-learning exists, even in synchronous settings

Research Motivation

Establishing central limit theorems is crucial for understanding the statistical properties of algorithms. This asymptotic normality is significant for uncertainty quantification and statistical inference in reinforcement learning.

Core Contributions

First Non-Asymptotic CLT for Q-Learning: Proves a non-asymptotic CLT for asynchronous averaged Q-learning with convergence rate $\tilde{O}((|S||A|)^{1/2}K^{-1/6}\rho^{-2}(1-\gamma)^{-3})$
Functional Central Limit Theorem: Establishes a functional CLT for asynchronous Q-learning with decaying step sizes, showing weak convergence of the partial sum process to Brownian motion
Explicit Dependency Relationships: The convergence rate explicitly reflects dependencies on the number of iterations $K$ , state-action space size $|S||A|$ , discount factor $\gamma$ , and exploration quality $\rho$
Technical Innovation: Addresses analytical challenges posed by nonlinearity, Markovian noise, and non-smooth operators

Methodology Details

Problem Formulation

Consider an infinite-horizon discounted Markov Decision Process (MDP) $M = \langle S, A, P, r, \gamma \rangle$ , where:

$S$ : state space
$A$ : action space
$P: S \times A \rightarrow \Delta_S$ : transition probability function
$\gamma \in [0,1)$ : discount factor

The goal is to learn the optimal Q-function $Q^* = \max_\pi Q^\pi$ .

Asynchronous Q-Learning Algorithm

Asynchronous Q-learning maintains a Q-function estimator $Q_k$ with update rule: $Q_{k+1} = Q_k + \alpha_k(F_k - Q_k)$

where:

$F_k = F(Q_k, y_k)$ , $y_k = (s_k, a_k, s_{k+1})$
$[F(Q_k, s_k, a_k, s_{k+1})](s,a) = \mathbf{1}_{\{(s_k,a_k)=(s,a)\}}\Gamma(Q_k, s_k, a_k, s_{k+1}) + Q_k(s,a)$
$\Gamma(Q_k, s_k, a_k, s_{k+1}) = r_k(s_k, a_k) + \gamma\max_a Q_k(s_{k+1}, a) - Q_k(s_k, a_k)$

Key Assumptions

Assumption 1: There exists an optimal policy $\pi^*$ such that for $Q \in \mathbb{R}^{|S|\times|A|}$ : $\|(P^\pi - P^{\pi^*})(Q-Q^*)\|_\infty \leq L\|Q-Q^*\|_2^\infty$

Assumption 2: $\{y_k\}_{k \geq 0}$ is an irreducible and aperiodic finite-state Markov chain.

Step Size Selection

Polynomial step sizes are chosen as $\alpha_k = \alpha(k+b)^{-\beta}$ , where $\alpha, b > 0$ , $\beta \in (0.5, 1)$ .

Rationale for this choice:

Satisfies key conditions for the Polyak-Juditsky averaging scheme
Constant step sizes violate conditions (i) and (iii); linear step sizes violate condition (ii)
Polynomial step sizes satisfy all necessary conditions

Main Theoretical Results

Non-Asymptotic Central Limit Theorem

Theorem 4: Under Assumptions 1 and 2: $W_1\left(K^{-1/2}\sum_{k=1}^K \Delta_k, \tilde{N}\right) \leq \frac{(|S||A|)^{1/2}}{\rho(1-\gamma)^2K^{1/2}} \cdot \tilde{O}\left((\rho(1-\gamma))^{\frac{\beta-2}{1-\beta}} + K^{\beta/2}\rho^{-1}(1-\gamma)^{-1} + K^{1-\beta} + K^{\frac{1-\beta}{2}}\rho^{-1-\beta}(1-\gamma)^{-\beta}\right)$

where $\Delta_k = Q_k - Q^*$ and $\tilde{N} = (A^{-1}\Sigma A^{-\top})^{1/2}N(0,I)$ .

Corollary 5: When $\beta = 2/3$ , the convergence rate simplifies to: $W_1\left(K^{-1/2}\sum_{k=1}^K \Delta_k, (A^{-1}\Sigma A^{-\top})^{1/2}N(0,I)\right) \leq \tilde{O}\left(\frac{(|S||A|)^{1/2}}{K^{1/6}\rho^2(1-\gamma)^3}\right)$

Functional Central Limit Theorem

Theorem 6: Under the setting of Theorem 4, the partial sum process $\Phi_K(\zeta) = K^{-1/2}\sum_{k=1}^{\lfloor\zeta K\rfloor}\Delta_k$ converges weakly on $D[0,1]$ to $(A^{-1}\Sigma A^{-\top})^{1/2}B(\cdot)$ , where $B(\cdot)$ is standard Brownian motion.

Technical Innovation and Proof Strategy

Main Technical Challenges

Nonlinearity: Q-learning is nonlinear SA, more complex than linear SA
Markovian Noise: Asynchronous updates introduce non-i.i.d. Markovian noise
Non-Smooth Operators: The empirical Bellman operator in asynchronous Q-learning is non-smooth

Proof Strategy

Upper and Lower Bound Techniques: Introduces upper bound sequence $\Delta_k^{\uparrow}$ and lower bound sequence $\Delta_k^{\downarrow}$ , utilizing the squeeze theorem
Term Decomposition: Decomposes $\sum_{k=1}^K \Delta_k$ $\sum_{k = 1}^{K} Δ_{k}$ into six terms:
- Term (1): Initial error term
- Term (2): Nonlinearity error term
- Term (3): Markovian noise term
- Terms (4-5): Higher-order correction terms
- Term (6): Martingale difference sequence
Poisson Equation Technique: Transforms Markovian noise into martingale difference sequences
Martingale Central Limit Theorem: Applies the non-asymptotic martingale CLT from Srikant (2024)

Stochastic Approximation Theory

Polyak and Juditsky (1992): Classical variance reduction technique through averaging
Anastasiou et al. (2019): Non-asymptotic CLT for Polyak-Ruppert averaged SGD
Mou et al. (2020): Non-asymptotic CLT for linear SA

CLTs in Reinforcement Learning

Xie and Zhang (2022), Li et al. (2023): Functional CLTs for synchronous Q-learning
Zhang and Xie (2024): Functional CLT for constant step-size asynchronous Q-learning
Srikant (2024), Samsonov et al. (2024): Non-asymptotic CLTs for TD learning

Conclusions and Discussion

Main Conclusions

Establishes the first non-asymptotic CLT for Q-learning with convergence rate explicitly dependent on problem parameters
Proves weak convergence of the partial sum process in asynchronous Q-learning
Provides theoretical foundation for uncertainty quantification in reinforcement learning

Limitations

Requires strong Lipschitz assumptions (Assumption 1)
Only considers finite state-action spaces
Convergence rate may not be optimal

Future Directions

Improve convergence rates
Extend beyond 1-Wasserstein distance to other metrics
Consider function approximation settings

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: First establishment of non-asymptotic CLT for Q-learning, filling an important theoretical gap
Technical Innovation: Cleverly combines upper/lower bound techniques, Poisson equations, and martingale CLT to overcome technical challenges
Complete Results: Provides both non-asymptotic and functional CLTs
Explicit Dependencies: Convergence rate clearly reflects the impact of each parameter

Weaknesses

Strong Assumptions: Lipschitz assumptions may be difficult to verify in practice
Convergence Rate: The $K^{-1/6}$ convergence rate is relatively slow
Finite State Spaces: Does not address continuous state spaces or function approximation

Impact

Theoretical Value: Provides new tools and perspectives for Q-learning theoretical analysis
Practical Significance: Establishes theoretical foundation for uncertainty quantification in reinforcement learning algorithms
Methodological: Proof techniques are generalizable to other nonlinear SA problems

Applicable Scenarios

Theoretical analysis of tabular reinforcement learning problems
Convergence analysis of asynchronous update algorithms
Statistical inference and confidence interval construction in reinforcement learning

References

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging.
Xie, C. and Zhang, Z. (2022). A statistical online inference approach in averaged stochastic approximation.
Zhang, Y. and Xie, Q. (2024). Constant stepsize q-learning: Distributional convergence, bias and extrapolation.