2025-11-25T16:46:17.731757

Convergence of actor-critic for entropy regularised MDPs in general action spaces

Zorba, Šiška, Szpruch
We prove the stability and global convergence of a coupled actor-critic gradient flow for infinite-horizon and entropy-regularised Markov decision processes (MDPs) in continuous state and action space with linear function approximation under Q-function realisability. We consider a version of the actor critic gradient flow where the critic is updated using temporal difference (TD) learning while the policy is updated using a policy mirror descent method on a separate timescale. We demonstrate stability and exponential convergence of the actor critic flow to the optimal policy. Finally, we address the interplay of the timescale separation and entropy regularisation and its effect on stability and convergence.
academic

Convergence of actor-critic for entropy regularised MDPs in general action spaces

Basic Information

  • Paper ID: 2510.14898
  • Title: Convergence of actor-critic for entropy regularised MDPs in general action spaces
  • Authors: Denis Zorba, David Šiška, Lukasz Szpruch
  • Classification: math.OC (Optimization and Control)
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14898

Abstract

This paper establishes the stability and global convergence of coupled actor-critic gradient flows for infinite-horizon entropy-regularized Markov Decision Processes (MDPs) in continuous state and action spaces with linear function approximation and Q-function realizability conditions. The study considers an actor-critic gradient flow variant where the critic is updated using temporal difference (TD) learning, while the policy is updated using policy mirror descent methods on different timescales. The paper proves stability and exponential convergence of the actor-critic flow toward the optimal policy, and analyzes the interplay between timescale separation and entropy regularization on stability and convergence.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is the stability and convergence analysis of actor-critic methods in entropy-regularized MDPs with general action spaces (continuous or infinite). Specifically:

  1. Stability Problem: Whether coupled updates of actor and critic under continuous-time dynamics lead to system instability
  2. Convergence Problem: Whether the system converges to the optimal policy and the convergence rate
  3. Timescale Separation: The impact of different update speeds on system performance

Research Significance

  1. Theoretical Foundation: Provides rigorous theoretical guarantees for actor-critic algorithms widely used in practical applications
  2. Generalization Extension: Extends existing finite action space results to continuous/infinite action spaces
  3. Entropy Regularization: Analyzes the role of entropy regularization in promoting exploration and accelerating convergence

Limitations of Existing Methods

  1. Action Space Restrictions: Existing convergence results for entropy-regularized MDPs are primarily limited to finite action spaces
  2. Function Approximation Challenges: Lack of prior bounds on function approximation in general state and action spaces
  3. Coupled Analysis Complexity: Requires combining convex analysis tools on Euclidean spaces and measure spaces

Core Contributions

  1. Stability Framework: Develops a Lyapunov-based stability framework capturing the interplay between entropy regularization and timescale separation
  2. Convergence Proof: Proves convergence of actor-critic dynamics in entropy-regularized MDPs with infinite action spaces
  3. Exponential Convergence Rate: Establishes exponential convergence rate toward the optimal policy
  4. Continuous-Time Analysis: Analyzes coupled updates in the continuous-time limit, forming a semi-gradient flow for the critic and approximate Fisher-Rao gradient flow for the actor

Methodology Details

Task Definition

Consider an infinite-horizon MDP (S,A,P,c,γ)(S,A,P,c,γ), where:

  • SS, AA: Polish spaces (state and action spaces)
  • PP(SS×A)P \in P(S|S \times A): state transition kernel
  • cc: bounded cost function
  • γ(0,1)γ \in (0,1): discount factor
  • τ>0τ > 0: regularization parameter

The entropy-regularized value function is defined as: Vτπ(s)=Esπ[n=0γn(c(sn,an)+τKL(π(sn)μ))]V^π_τ(s) = E^π_s\left[\sum_{n=0}^∞ γ^n(c(s_n,a_n) + τ \text{KL}(π(·|s_n)|μ))\right]

Model Architecture

1. Policy Parameterization

The policy belongs to the admissible policy class ΠμΠ_μ: π(das)=exp(f(s,a))Aexp(f(s,a))μ(da)μ(da)π(da|s) = \frac{\exp(f(s,a))}{\int_A \exp(f(s,a))μ(da)}μ(da)

2. Q-Function Linear Approximation

Using feature mapping φ:S×ARNφ: S \times A → R^N: Q(s,a;θ)=θ,φ(s,a)Q(s,a;θ) = ⟨θ, φ(s,a)⟩

3. Coupled Dynamical System

Continuous-time actor-critic flow: dθtdt=ηtg(θt,πt)\frac{dθ_t}{dt} = -η_t g(θ_t, π_t)tπt(das)=At(s,a)πt(das)∂_t π_t(da|s) = -A_t(s,a)π_t(da|s)

where:

  • g(θ,π)g(θ,π): semi-gradient of mean squared Bellman error (MSBE)
  • At(s,a)A_t(s,a): approximate soft advantage function
  • ηtη_t: timescale separation parameter

Technical Innovations

1. Fisher-Rao Gradient Flow

Models policy updates as Fisher-Rao gradient flow on the space of probability measures: tlndπtdμ(s,a)=Aτπt(s,a)∂_t \ln\frac{dπ_t}{dμ}(s,a) = -A^{π_t}_τ(s,a)

2. Two-Timescale Analysis

  • Critic updates on fast timescale (TD learning)
  • Actor updates on slow timescale (policy mirror descent)

3. Lyapunov Stability Analysis

Constructs Lyapunov functions to analyze system stability, combining:

  • Convex analysis on Euclidean spaces
  • Convex analysis on measure spaces

Theoretical Analysis

Key Assumptions

Assumption 4.1 (Q^π_τ-Realizability): For all πΠμπ ∈ Π_μ and (s,a)S×A(s,a) ∈ S × A, there exists θπRNθ^π ∈ R^N such that: Qπ(s,a)=θπ,φ(s,a)Q^π(s,a) = ⟨θ^π, φ(s,a)⟩

Assumption 4.2: φ(s,a)1|φ(s,a)| ≤ 1 for all (s,a)S×A(s,a) ∈ S × A

Assumption 4.3: The minimum eigenvalue λβ>0λ_β > 0 of matrix S×Aφ(s,a)φ(s,a)β(ds,da)\int_{S×A} φ(s,a)φ(s,a)^⊤ β(ds,da)

Main Theoretical Results

Stability Theorem (Theorem 5.1)

Let η0>τΓη_0 > \frac{τ}{Γ}, where Γ=λβ(1γ)(1γ)Γ = λ_β(1-γ)(1-\sqrt{γ}). Then there exist constants a1,a2>0a_1, a_2 > 0 such that: Kt2a1+a20teτ(tr)Kr2drK_t^2 ≤ a_1 + a_2 \int_0^t e^{-τ(t-r)} K_r^2 dr

where Kt=supsSKL(πt(s)μ)K_t = \sup_{s∈S} \text{KL}(π_t(·|s)|μ).

Convergence Theorem (Theorem 6.1)

For all t>0t > 0: minr[0,t]Vτπr(ρ)Vτπ(ρ)τ2(1γ)(1eτ2t)(eτ2tSKL(π(s)π0(s))dρπ(ds)+12τ0teτ2(tr)θrθπr2dr)\min_{r∈[0,t]} V^{π_r}_τ(ρ) - V^{π^*}_τ(ρ) ≤ \frac{τ}{2(1-γ)(1-e^{-\frac{τ}{2}t})}\left(e^{-\frac{τ}{2}t}\int_S \text{KL}(π^*(·|s)|π_0(·|s))d^{π^*}_ρ(ds) + \frac{1}{2τ}\int_0^t e^{-\frac{τ}{2}(t-r)}|θ_r - θ^{π_r}|^2 dr\right)

Exponential Convergence (Theorem 6.3)

Under appropriate conditions, there exist ηt=η0ek1tη_t = η_0 e^{k_1 t} and constant k2>0k_2 > 0 such that: minr[0,t]Vτπr(ρ)Vτπ(ρ)τeτ2t2(1γ)(1eτ2t)(SKL(π(s)π0(s))dρπ(ds)+k22τ)\min_{r∈[0,t]} V^{π_r}_τ(ρ) - V^{π^*}_τ(ρ) ≤ \frac{τe^{-\frac{τ}{2}t}}{2(1-γ)(1-e^{-\frac{τ}{2}t})}\left(\int_S \text{KL}(π^*(·|s)|π_0(·|s))d^{π^*}_ρ(ds) + \frac{k_2}{2τ}\right)

Key Technical Tools

1. Performance Difference Lemma

Vτπ(ρ)Vτπ(ρ)=11γS[A(Qτπ(s,a)+τlndπdμ(a,s))(ππ)(das)+τKL(π(s)π(s))]dρπ(ds)V^π_τ(ρ) - V^{π'}_τ(ρ) = \frac{1}{1-γ}\int_S \left[\int_A (Q^{π'}_τ(s,a) + τ\ln\frac{dπ'}{dμ}(a,s))(π-π')(da|s) + τ\text{KL}(π(·|s)|π'(·|s))\right] d^π_ρ(ds)

2. Gronwall Inequality Application

Used to control the growth of KL divergence and parameter norms.

3. State-Action Occupancy Measure Properties

Lemma 5.1: dJπβπ(E)=Jπdβπ(E)d^π_{Jπβ}(E) = J_π d^π_β(E)dβπ(E)γdJπβπ(E)=(1γ)β(E)d^π_β(E) - γd^π_{J_π β}(E) = (1-γ)β(E)

Unregularized Settings

  • Borkar & Konda (1997): Two-timescale stochastic approximation
  • Bhandari et al. (2021): Finite-time analysis with linear function approximation
  • Zhang et al. (2021): Wasserstein flows and representation learning

Entropy-Regularized Settings

  • Cayci et al. (2024): Natural policy gradient for finite action spaces
  • This paper extends to general action spaces

Technical Contribution Comparison

Advantages of this paper over existing work:

  1. Handles continuous/infinite action spaces
  2. Rigorous stability and convergence proofs
  3. Analysis of interplay between entropy regularization and timescale separation

Conclusions and Discussion

Main Conclusions

  1. Stability Guarantee: System remains stable under appropriate timescale separation conditions
  2. Exponential Convergence: Exponential convergence rate toward the optimal policy
  3. Entropy Regularization Effect: Entropy regularization ensures unique optimal policy and accelerates convergence

Limitations

  1. Continuous-Time Assumption: Only analyzes continuous-time dynamics; discrete-time is more practical
  2. Linear Function Approximation: Practical applications often use nonlinear neural networks
  3. Exact Integration Assumption: Practical implementations require sampling estimates, introducing Monte Carlo errors
  4. Q-Function Realizability: Strong assumption that may not hold in practice

Future Directions

  1. Rigorous analysis of discrete-time algorithms
  2. Extension to nonlinear function approximation
  3. Handling of sampling errors
  4. Weaker realizability conditions

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete stability and convergence proofs
  2. Technical Innovation: Cleverly combines Fisher-Rao geometry with Lyapunov analysis
  3. Generality: Extends to continuous action spaces, filling theoretical gaps
  4. Clear Presentation: Detailed mathematical derivations with clear logic

Weaknesses

  1. Practical Limitations: Strong assumptions are difficult to satisfy in practice
  2. Missing Experimental Validation: Pure theoretical work lacking numerical verification
  3. Computational Complexity: Does not discuss algorithmic computational complexity
  4. Limited Applicability: Continuous-time assumption restricts practical applications

Impact

  1. Theoretical Contribution: Provides important theoretical foundation for entropy-regularized MDPs
  2. Methodological Value: Analysis techniques applicable to other reinforcement learning algorithms
  3. Future Research: Establishes foundation for research in discrete-time and more general settings

Applicable Scenarios

  1. Theoretical Research: Provides theoretical tools and insights for other research
  2. Algorithm Design: Guides parameter selection and convergence analysis in practical algorithms
  3. Continuous Control: Control problems in continuous state-action spaces

References

The paper cites 25 important references, covering:

  • Classical actor-critic methods (Konda & Tsitsiklis, 1999)
  • Entropy-regularized MDPs (Kerimkulov et al., 2024)
  • Policy gradient methods (Schulman et al., 2015, 2017)
  • Function approximation theory (Bhandari et al., 2021)

Overall Assessment: This is a high-quality theoretical paper providing rigorous mathematical analysis of actor-critic methods in entropy-regularized MDPs. While it has limitations in practical application, its theoretical contributions and methodological value are significant, laying an important foundation for further development in this field.