This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.
- Paper ID: 2509.18964
- Title: Central Limit Theorems for Asynchronous Averaged Q-Learning
- Author: Xingtu Liu (Simon Fraser University)
- Classification: cs.LG math.OC stat.ML
- Conference: OPT2025: 17th Annual Workshop on Optimization for Machine Learning
- Paper Link: https://arxiv.org/abs/2509.18964
This paper establishes central limit theorems (CLTs) for Polyak-Ruppert averaged Q-learning under asynchronous updates. The authors prove a non-asymptotic CLT whose convergence rate in the Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, discount factor, and exploration quality. Additionally, a functional CLT is derived, demonstrating weak convergence of the partial sum process to Brownian motion.
- Importance of Q-Learning: Q-learning is one of the most widely used algorithms in reinforcement learning, learning optimal action-value functions directly from empirical trajectories. It has achieved tremendous success in domains such as Atari games, Go, robotic manipulation, and large language model alignment.
- Challenges in Theoretical Analysis:
- Q-learning can be interpreted as an instance of stochastic approximation (SA), but asynchronous Q-learning is a nonlinear SA problem with Markovian noise
- Compared to linear SA and temporal difference (TD) learning, Q-learning analysis is more challenging due to its nonlinearity, non-smooth operators, and non-stationary processes
- Asynchronous updates further introduce Markovian noise, increasing analytical complexity
- Limitations of Existing Work:
- Previous work established functional CLTs for synchronous Q-learning, but synchronous Q-learning only considers martingale noise
- Zhang and Xie (2024) established functional CLTs for asynchronous Q-learning with constant step sizes, but constant step sizes do not satisfy necessary conditions for establishing non-asymptotic CLTs
- Currently, no non-asymptotic CLT for Q-learning exists, even in synchronous settings
Establishing central limit theorems is crucial for understanding the statistical properties of algorithms. This asymptotic normality is significant for uncertainty quantification and statistical inference in reinforcement learning.
- First Non-Asymptotic CLT for Q-Learning: Proves a non-asymptotic CLT for asynchronous averaged Q-learning with convergence rate O~((∣S∣∣A∣)1/2K−1/6ρ−2(1−γ)−3)
- Functional Central Limit Theorem: Establishes a functional CLT for asynchronous Q-learning with decaying step sizes, showing weak convergence of the partial sum process to Brownian motion
- Explicit Dependency Relationships: The convergence rate explicitly reflects dependencies on the number of iterations K, state-action space size ∣S∣∣A∣, discount factor γ, and exploration quality ρ
- Technical Innovation: Addresses analytical challenges posed by nonlinearity, Markovian noise, and non-smooth operators
Consider an infinite-horizon discounted Markov Decision Process (MDP) M=⟨S,A,P,r,γ⟩, where:
- S: state space
- A: action space
- P:S×A→ΔS: transition probability function
- γ∈[0,1): discount factor
The goal is to learn the optimal Q-function Q∗=maxπQπ.
Asynchronous Q-learning maintains a Q-function estimator Qk with update rule:
Qk+1=Qk+αk(Fk−Qk)
where:
- Fk=F(Qk,yk), yk=(sk,ak,sk+1)
- [F(Qk,sk,ak,sk+1)](s,a)=1{(sk,ak)=(s,a)}Γ(Qk,sk,ak,sk+1)+Qk(s,a)
- Γ(Qk,sk,ak,sk+1)=rk(sk,ak)+γmaxaQk(sk+1,a)−Qk(sk,ak)
Assumption 1: There exists an optimal policy π∗ such that for Q∈R∣S∣×∣A∣:
∥(Pπ−Pπ∗)(Q−Q∗)∥∞≤L∥Q−Q∗∥2∞
Assumption 2: {yk}k≥0 is an irreducible and aperiodic finite-state Markov chain.
Polynomial step sizes are chosen as αk=α(k+b)−β, where α,b>0, β∈(0.5,1).
Rationale for this choice:
- Satisfies key conditions for the Polyak-Juditsky averaging scheme
- Constant step sizes violate conditions (i) and (iii); linear step sizes violate condition (ii)
- Polynomial step sizes satisfy all necessary conditions
Theorem 4: Under Assumptions 1 and 2:
W1(K−1/2∑k=1KΔk,N~)≤ρ(1−γ)2K1/2(∣S∣∣A∣)1/2⋅O~((ρ(1−γ))1−ββ−2+Kβ/2ρ−1(1−γ)−1+K1−β+K21−βρ−1−β(1−γ)−β)
where Δk=Qk−Q∗ and N~=(A−1ΣA−⊤)1/2N(0,I).
Corollary 5: When β=2/3, the convergence rate simplifies to:
W1(K−1/2∑k=1KΔk,(A−1ΣA−⊤)1/2N(0,I))≤O~(K1/6ρ2(1−γ)3(∣S∣∣A∣)1/2)
Theorem 6: Under the setting of Theorem 4, the partial sum process ΦK(ζ)=K−1/2∑k=1⌊ζK⌋Δk converges weakly on D[0,1] to (A−1ΣA−⊤)1/2B(⋅), where B(⋅) is standard Brownian motion.
- Nonlinearity: Q-learning is nonlinear SA, more complex than linear SA
- Markovian Noise: Asynchronous updates introduce non-i.i.d. Markovian noise
- Non-Smooth Operators: The empirical Bellman operator in asynchronous Q-learning is non-smooth
- Upper and Lower Bound Techniques: Introduces upper bound sequence Δk↑ and lower bound sequence Δk↓, utilizing the squeeze theorem
- Term Decomposition: Decomposes ∑k=1KΔk into six terms:
- Term (1): Initial error term
- Term (2): Nonlinearity error term
- Term (3): Markovian noise term
- Terms (4-5): Higher-order correction terms
- Term (6): Martingale difference sequence
- Poisson Equation Technique: Transforms Markovian noise into martingale difference sequences
- Martingale Central Limit Theorem: Applies the non-asymptotic martingale CLT from Srikant (2024)
- Polyak and Juditsky (1992): Classical variance reduction technique through averaging
- Anastasiou et al. (2019): Non-asymptotic CLT for Polyak-Ruppert averaged SGD
- Mou et al. (2020): Non-asymptotic CLT for linear SA
- Xie and Zhang (2022), Li et al. (2023): Functional CLTs for synchronous Q-learning
- Zhang and Xie (2024): Functional CLT for constant step-size asynchronous Q-learning
- Srikant (2024), Samsonov et al. (2024): Non-asymptotic CLTs for TD learning
- Establishes the first non-asymptotic CLT for Q-learning with convergence rate explicitly dependent on problem parameters
- Proves weak convergence of the partial sum process in asynchronous Q-learning
- Provides theoretical foundation for uncertainty quantification in reinforcement learning
- Requires strong Lipschitz assumptions (Assumption 1)
- Only considers finite state-action spaces
- Convergence rate may not be optimal
- Improve convergence rates
- Extend beyond 1-Wasserstein distance to other metrics
- Consider function approximation settings
- Significant Theoretical Contribution: First establishment of non-asymptotic CLT for Q-learning, filling an important theoretical gap
- Technical Innovation: Cleverly combines upper/lower bound techniques, Poisson equations, and martingale CLT to overcome technical challenges
- Complete Results: Provides both non-asymptotic and functional CLTs
- Explicit Dependencies: Convergence rate clearly reflects the impact of each parameter
- Strong Assumptions: Lipschitz assumptions may be difficult to verify in practice
- Convergence Rate: The K−1/6 convergence rate is relatively slow
- Finite State Spaces: Does not address continuous state spaces or function approximation
- Theoretical Value: Provides new tools and perspectives for Q-learning theoretical analysis
- Practical Significance: Establishes theoretical foundation for uncertainty quantification in reinforcement learning algorithms
- Methodological: Proof techniques are generalizable to other nonlinear SA problems
- Theoretical analysis of tabular reinforcement learning problems
- Convergence analysis of asynchronous update algorithms
- Statistical inference and confidence interval construction in reinforcement learning
- Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging.
- Xie, C. and Zhang, Z. (2022). A statistical online inference approach in averaged stochastic approximation.
- Zhang, Y. and Xie, Q. (2024). Constant stepsize q-learning: Distributional convergence, bias and extrapolation.