Convergence of actor-critic for entropy regularised MDPs in general action spaces
Zorba, Šiška, Szpruch
We prove the stability and global convergence of a coupled actor-critic gradient flow for infinite-horizon and entropy-regularised Markov decision processes (MDPs) in continuous state and action space with linear function approximation under Q-function realisability. We consider a version of the actor critic gradient flow where the critic is updated using temporal difference (TD) learning while the policy is updated using a policy mirror descent method on a separate timescale. We demonstrate stability and exponential convergence of the actor critic flow to the optimal policy. Finally, we address the interplay of the timescale separation and entropy regularisation and its effect on stability and convergence.
academic
Convergence of actor-critic for entropy regularised MDPs in general action spaces
This paper establishes the stability and global convergence of coupled actor-critic gradient flows for infinite-horizon entropy-regularized Markov Decision Processes (MDPs) in continuous state and action spaces with linear function approximation and Q-function realizability conditions. The study considers an actor-critic gradient flow variant where the critic is updated using temporal difference (TD) learning, while the policy is updated using policy mirror descent methods on different timescales. The paper proves stability and exponential convergence of the actor-critic flow toward the optimal policy, and analyzes the interplay between timescale separation and entropy regularization on stability and convergence.
The core problem addressed in this paper is the stability and convergence analysis of actor-critic methods in entropy-regularized MDPs with general action spaces (continuous or infinite). Specifically:
Stability Problem: Whether coupled updates of actor and critic under continuous-time dynamics lead to system instability
Convergence Problem: Whether the system converges to the optimal policy and the convergence rate
Timescale Separation: The impact of different update speeds on system performance
Continuous-Time Analysis: Analyzes coupled updates in the continuous-time limit, forming a semi-gradient flow for the critic and approximate Fisher-Rao gradient flow for the actor
Under appropriate conditions, there exist ηt=η0ek1t and constant k2>0 such that:
minr∈[0,t]Vτπr(ρ)−Vτπ∗(ρ)≤2(1−γ)(1−e−2τt)τe−2τt(∫SKL(π∗(⋅∣s)∣π0(⋅∣s))dρπ∗(ds)+2τk2)
Entropy-regularized MDPs (Kerimkulov et al., 2024)
Policy gradient methods (Schulman et al., 2015, 2017)
Function approximation theory (Bhandari et al., 2021)
Overall Assessment: This is a high-quality theoretical paper providing rigorous mathematical analysis of actor-critic methods in entropy-regularized MDPs. While it has limitations in practical application, its theoretical contributions and methodological value are significant, laying an important foundation for further development in this field.