2025-11-25T16:46:17.731757

Convergence of actor-critic for entropy regularised MDPs in general action spaces

Zorba, Å iÅ¡ka, Szpruch

We prove the stability and global convergence of a coupled actor-critic gradient flow for infinite-horizon and entropy-regularised Markov decision processes (MDPs) in continuous state and action space with linear function approximation under Q-function realisability. We consider a version of the actor critic gradient flow where the critic is updated using temporal difference (TD) learning while the policy is updated using a policy mirror descent method on a separate timescale. We demonstrate stability and exponential convergence of the actor critic flow to the optimal policy. Finally, we address the interplay of the timescale separation and entropy regularisation and its effect on stability and convergence.

academic

Convergence of actor-critic for entropy regularised MDPs in general action spaces

Basic Information

Paper ID: 2510.14898
Title: Convergence of actor-critic for entropy regularised MDPs in general action spaces
Authors: Denis Zorba, David Šiška, Lukasz Szpruch
Classification: math.OC (Optimization and Control)
Publication Date: October 16, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14898

Abstract

This paper establishes the stability and global convergence of coupled actor-critic gradient flows for infinite-horizon entropy-regularized Markov Decision Processes (MDPs) in continuous state and action spaces with linear function approximation and Q-function realizability conditions. The study considers an actor-critic gradient flow variant where the critic is updated using temporal difference (TD) learning, while the policy is updated using policy mirror descent methods on different timescales. The paper proves stability and exponential convergence of the actor-critic flow toward the optimal policy, and analyzes the interplay between timescale separation and entropy regularization on stability and convergence.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is the stability and convergence analysis of actor-critic methods in entropy-regularized MDPs with general action spaces (continuous or infinite). Specifically:

Stability Problem: Whether coupled updates of actor and critic under continuous-time dynamics lead to system instability
Convergence Problem: Whether the system converges to the optimal policy and the convergence rate
Timescale Separation: The impact of different update speeds on system performance

Research Significance

Theoretical Foundation: Provides rigorous theoretical guarantees for actor-critic algorithms widely used in practical applications
Generalization Extension: Extends existing finite action space results to continuous/infinite action spaces
Entropy Regularization: Analyzes the role of entropy regularization in promoting exploration and accelerating convergence

Limitations of Existing Methods

Action Space Restrictions: Existing convergence results for entropy-regularized MDPs are primarily limited to finite action spaces
Function Approximation Challenges: Lack of prior bounds on function approximation in general state and action spaces
Coupled Analysis Complexity: Requires combining convex analysis tools on Euclidean spaces and measure spaces

Core Contributions

Stability Framework: Develops a Lyapunov-based stability framework capturing the interplay between entropy regularization and timescale separation
Convergence Proof: Proves convergence of actor-critic dynamics in entropy-regularized MDPs with infinite action spaces
Exponential Convergence Rate: Establishes exponential convergence rate toward the optimal policy
Continuous-Time Analysis: Analyzes coupled updates in the continuous-time limit, forming a semi-gradient flow for the critic and approximate Fisher-Rao gradient flow for the actor

Methodology Details

Task Definition

Consider an infinite-horizon MDP $(S,A,P,c,γ)$ , where:

$S$ , $A$ : Polish spaces (state and action spaces)
$P \in P(S|S \times A)$ : state transition kernel
$c$ : bounded cost function
$γ \in (0,1)$ : discount factor
$τ > 0$ : regularization parameter

The entropy-regularized value function is defined as: $V^π_τ(s) = E^π_s\left[\sum_{n=0}^∞ γ^n(c(s_n,a_n) + τ \text{KL}(π(·|s_n)|μ))\right]$

Model Architecture

1. Policy Parameterization

The policy belongs to the admissible policy class $Π_μ$ : $π(da|s) = \frac{\exp(f(s,a))}{\int_A \exp(f(s,a))μ(da)}μ(da)$

2. Q-Function Linear Approximation

Using feature mapping $φ: S \times A → R^N$ : $Q(s,a;θ) = ⟨θ, φ(s,a)⟩$

3. Coupled Dynamical System

Continuous-time actor-critic flow: $\frac{dθ_t}{dt} = -η_t g(θ_t, π_t)$ $∂_t π_t(da|s) = -A_t(s,a)π_t(da|s)$

where:

$g(θ,π)$ : semi-gradient of mean squared Bellman error (MSBE)
$A_t(s,a)$ : approximate soft advantage function
$η_t$ : timescale separation parameter

Technical Innovations

1. Fisher-Rao Gradient Flow

Models policy updates as Fisher-Rao gradient flow on the space of probability measures: $∂_t \ln\frac{dπ_t}{dμ}(s,a) = -A^{π_t}_τ(s,a)$

2. Two-Timescale Analysis

Critic updates on fast timescale (TD learning)
Actor updates on slow timescale (policy mirror descent)

3. Lyapunov Stability Analysis

Constructs Lyapunov functions to analyze system stability, combining:

Convex analysis on Euclidean spaces
Convex analysis on measure spaces

Theoretical Analysis

Key Assumptions

Assumption 4.1 (Q^π_τ-Realizability): For all $π ∈ Π_μ$ and $(s,a) ∈ S × A$ , there exists $θ^π ∈ R^N$ such that: $Q^π(s,a) = ⟨θ^π, φ(s,a)⟩$

Assumption 4.2: $|φ(s,a)| ≤ 1$ for all $(s,a) ∈ S × A$

Assumption 4.3: The minimum eigenvalue $λ_β > 0$ of matrix $\int_{S×A} φ(s,a)φ(s,a)^⊤ β(ds,da)$

Main Theoretical Results

Stability Theorem (Theorem 5.1)

Let $η_0 > \frac{τ}{Γ}$ , where $Γ = λ_β(1-γ)(1-\sqrt{γ})$ . Then there exist constants $a_1, a_2 > 0$ such that: $K_t^2 ≤ a_1 + a_2 \int_0^t e^{-τ(t-r)} K_r^2 dr$

where $K_t = \sup_{s∈S} \text{KL}(π_t(·|s)|μ)$ .

Convergence Theorem (Theorem 6.1)

For all $t > 0$ : $\min_{r∈[0,t]} V^{π_r}_τ(ρ) - V^{π^*}_τ(ρ) ≤ \frac{τ}{2(1-γ)(1-e^{-\frac{τ}{2}t})}\left(e^{-\frac{τ}{2}t}\int_S \text{KL}(π^*(·|s)|π_0(·|s))d^{π^*}_ρ(ds) + \frac{1}{2τ}\int_0^t e^{-\frac{τ}{2}(t-r)}|θ_r - θ^{π_r}|^2 dr\right)$

Exponential Convergence (Theorem 6.3)

Under appropriate conditions, there exist $η_t = η_0 e^{k_1 t}$ and constant $k_2 > 0$ such that: $\min_{r∈[0,t]} V^{π_r}_τ(ρ) - V^{π^*}_τ(ρ) ≤ \frac{τe^{-\frac{τ}{2}t}}{2(1-γ)(1-e^{-\frac{τ}{2}t})}\left(\int_S \text{KL}(π^*(·|s)|π_0(·|s))d^{π^*}_ρ(ds) + \frac{k_2}{2τ}\right)$

Key Technical Tools

1. Performance Difference Lemma

$V^π_τ(ρ) - V^{π'}_τ(ρ) = \frac{1}{1-γ}\int_S \left[\int_A (Q^{π'}_τ(s,a) + τ\ln\frac{dπ'}{dμ}(a,s))(π-π')(da|s) + τ\text{KL}(π(·|s)|π'(·|s))\right] d^π_ρ(ds)$

2. Gronwall Inequality Application

Used to control the growth of KL divergence and parameter norms.

3. State-Action Occupancy Measure Properties

Lemma 5.1: $d^π_{Jπβ}(E) = J_π d^π_β(E)$ $d^π_β(E) - γd^π_{J_π β}(E) = (1-γ)β(E)$

Unregularized Settings

Borkar & Konda (1997): Two-timescale stochastic approximation
Bhandari et al. (2021): Finite-time analysis with linear function approximation
Zhang et al. (2021): Wasserstein flows and representation learning

Entropy-Regularized Settings

Cayci et al. (2024): Natural policy gradient for finite action spaces
This paper extends to general action spaces

Technical Contribution Comparison

Advantages of this paper over existing work:

Handles continuous/infinite action spaces
Rigorous stability and convergence proofs
Analysis of interplay between entropy regularization and timescale separation

Conclusions and Discussion

Main Conclusions

Stability Guarantee: System remains stable under appropriate timescale separation conditions
Exponential Convergence: Exponential convergence rate toward the optimal policy
Entropy Regularization Effect: Entropy regularization ensures unique optimal policy and accelerates convergence

Limitations

Continuous-Time Assumption: Only analyzes continuous-time dynamics; discrete-time is more practical
Linear Function Approximation: Practical applications often use nonlinear neural networks
Exact Integration Assumption: Practical implementations require sampling estimates, introducing Monte Carlo errors
Q-Function Realizability: Strong assumption that may not hold in practice

Future Directions

Rigorous analysis of discrete-time algorithms
Extension to nonlinear function approximation
Handling of sampling errors
Weaker realizability conditions

In-Depth Evaluation

Strengths

Theoretical Rigor: Provides complete stability and convergence proofs
Technical Innovation: Cleverly combines Fisher-Rao geometry with Lyapunov analysis
Generality: Extends to continuous action spaces, filling theoretical gaps
Clear Presentation: Detailed mathematical derivations with clear logic

Weaknesses

Practical Limitations: Strong assumptions are difficult to satisfy in practice
Missing Experimental Validation: Pure theoretical work lacking numerical verification
Computational Complexity: Does not discuss algorithmic computational complexity
Limited Applicability: Continuous-time assumption restricts practical applications

Impact

Theoretical Contribution: Provides important theoretical foundation for entropy-regularized MDPs
Methodological Value: Analysis techniques applicable to other reinforcement learning algorithms
Future Research: Establishes foundation for research in discrete-time and more general settings

Applicable Scenarios

Theoretical Research: Provides theoretical tools and insights for other research
Algorithm Design: Guides parameter selection and convergence analysis in practical algorithms
Continuous Control: Control problems in continuous state-action spaces

References

The paper cites 25 important references, covering:

Classical actor-critic methods (Konda & Tsitsiklis, 1999)
Entropy-regularized MDPs (Kerimkulov et al., 2024)
Policy gradient methods (Schulman et al., 2015, 2017)
Function approximation theory (Bhandari et al., 2021)

Overall Assessment: This is a high-quality theoretical paper providing rigorous mathematical analysis of actor-critic methods in entropy-regularized MDPs. While it has limitations in practical application, its theoretical contributions and methodological value are significant, laying an important foundation for further development in this field.