2025-11-20T12:37:14.096690

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Ding, Zhang, Duan et al.

We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.

academic

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Basic Information

Paper ID: 2206.02346
Title: Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs
Authors: Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, Mihailo R. Jovanović
Classification: math.OC cs.AI cs.LG cs.SY eess.SY
Published Journal: Journal of Machine Learning Research 26 (2025) 1-76
Paper Link: https://arxiv.org/abs/2206.02346

Abstract

This paper investigates the sequential decision-making problem of maximizing expected cumulative rewards subject to expected cumulative utility constraints. The authors employ natural policy gradient methods to solve the discounted infinite-horizon optimal control problem for constrained Markov Decision Processes (constrained MDPs). Specifically, they propose a novel Natural Policy Gradient Primal-Dual (NPG-PD) method that updates primal variables via natural policy gradient ascent and dual variables via projected subgradient descent. Despite the underlying maximization problem involving non-concave objectives and non-convex constraint sets, the method achieves sublinear global convergence rates in both optimality gap and constraint violation under softmax policy parameterization, independent of the state-action space size (dimension-free). Furthermore, for log-linear and general smooth policy parameterizations, sublinear convergence rates are established up to function approximation errors induced by restricted policy parameterization.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is optimal policy learning in Constrained MDPs:

Objective: Maximize expected cumulative reward $V^π_r(ρ)$
Constraints: Satisfy expected cumulative utility constraints $V^π_g(ρ) ≥ b$
Challenges: Non-concave objective function, non-convex constraint set

Significance

Constrained MDPs are crucial in safety-critical applications:

Autonomous Driving: Requires maximizing performance while ensuring safety constraints
Robotics: Must satisfy physical and safety limitations during task execution
Cybersecurity: Maintains security policies while optimizing system performance
Financial Management: Pursues returns while controlling risk exposure

Limitations of Existing Methods

Insufficient Theoretical Guarantees: Most existing methods provide only asymptotic or local convergence guarantees
Dimension Dependence: Convergence rates typically depend on state-action space size
Function Approximation Errors: Lack rigorous analysis under function approximation
Sample Complexity: Absence of finite-sample complexity theoretical guarantees

Core Contributions

Proposes NPG-PD Algorithm: Designs a novel algorithmic framework combining natural policy gradient and primal-dual methods
Global Convergence Guarantees: Proves dimension-free global convergence under softmax parameterization
Function Approximation Theory: Establishes convergence theory for log-linear and general smooth policy parameterizations
Sample Complexity Analysis: Provides finite-sample complexity guarantees for two sample-based NPG-PD algorithms
Experimental Validation: Verifies method effectiveness through robotic simulation tasks

Methodology Details

Task Definition

Constrained MDP is defined as a seven-tuple $(\mathcal{S}, \mathcal{A}, P, r, g, b, γ, ρ)$ :

$\mathcal{S}$ : Finite state space
$\mathcal{A}$ : Finite action space
$P$ : Transition probability
$r, g$ : Reward and utility functions
$b$ : Constraint threshold
$γ$ : Discount factor
$ρ$ : Initial state distribution

Optimization Problem: $\max_{π ∈ Π} V^π_r(ρ) \quad \text{s.t.} \quad V^π_g(ρ) ≥ b$

Model Architecture

1. Lagrangian Duality

Transforms the constrained optimization problem into a saddle-point problem: $\max_{π ∈ Π} \min_{λ ≥ 0} V^π_r(ρ) + λ(V^π_g(ρ) - b)$

2. NPG-PD Algorithm Core Updates

Primal Variable Update (Natural Policy Gradient): $θ^{(t+1)} = θ^{(t)} + η_1 F^†_ρ(θ^{(t)})∇_θ V^{θ^{(t)},λ^{(t)}}_L(ρ)$

Dual Variable Update (Projected Subgradient Descent): $λ^{(t+1)} = P_Λ\left(λ^{(t)} - η_2(V^{θ^{(t)}}_g(ρ) - b)\right)$

Where:

$F^†_ρ(θ)$ : Moore-Penrose inverse of Fisher information matrix
$P_Λ$ : Projection onto interval $[0, 2/((1-γ)ξ)]$

3. Simplified Form Under Softmax Policy Parameterization

Under softmax parameterization $π_θ(a|s) = \frac{\exp(θ_{s,a})}{\sum_{a'} \exp(θ_{s,a'})}$ , the update simplifies to:

$θ^{(t+1)}_{s,a} = θ^{(t)}_{s,a} + \frac{η_1}{1-γ}A^{(t)}_L(s,a)$

Equivalent to multiplicative weight update: $π^{(t+1)}(a|s) = \frac{π^{(t)}(a|s)\exp\left(\frac{η_1}{1-γ}A^{(t)}_L(s,a)\right)}{Z^{(t)}(s)}$

Technical Innovations

Dimension-Free Convergence: Leverages softmax structure to achieve convergence rates independent of state-action space size
Non-Convex Constraint Handling: Processes non-convex constraint sets through novel primal-dual analysis
Function Approximation Error Decomposition: Introduces estimation-propagation error decomposition framework
Regret-Type Analysis: Employs regret analysis techniques from online learning

Theoretical Results

Main Convergence Theorem

Theorem 10 (Global Convergence under Softmax Parameterization): Under Slater condition, with $η_1 = 2\log|A|$ , $η_2 = 2(1-γ)/\sqrt{T}$ , the NPG-PD algorithm satisfies:

Optimality Gap: $\frac{1}{T}\sum_{t=0}^{T-1}(V^*_r(ρ) - V^{(t)}_r(ρ)) ≤ \frac{7}{(1-γ)^2}\frac{1}{\sqrt{T}}$

Constraint Violation: $\left[\frac{1}{T}\sum_{t=0}^{T-1}(b - V^{(t)}_g(ρ))\right]_+ ≤ \frac{2}{ξ} + \frac{4ξ}{(1-γ)^2}\frac{1}{\sqrt{T}}$

Function Approximation Case

Theorem 16 (Log-Linear Parameterization): Under function approximation setting, the convergence rate is: $E\left[\frac{1}{T}\sum_{t=0}^{T-1}(V^*_r(ρ) - V^{(t)}_r(ρ))\right] ≤ \frac{C_3}{(1-γ)^5}\frac{1}{\sqrt{T}} + \text{Function Approximation Error}$

Sample Complexity

Theorem 28/29 (Sample Complexity):

Iteration Complexity: $O(1/ε^2)$
Sample Complexity: $O(1/ε^4)$

This represents significant improvement over previous $O(1/ε^8)$ results.

Experimental Setup

Robotic Simulation Tasks

Uses six robotic locomotion tasks in MuJoCo environment:

Ant-v1, Humanoid-v1, HalfCheetah-v1, Walker2d-v1, Hopper-v1, Swimmer-v1
Constraints: Movement velocity not exceeding specified threshold (safety constraint)

Comparison Methods

Classical Primal-Dual Methods: TRPOLag, PPOLag
Latest Policy Optimization Methods: CUP, FOCOPS

Evaluation Metrics

Average Reward: Task performance
Average Cost: Constraint violation degree (average velocity)

Experimental Results

Main Findings

Performance Advantage: NPG-PD achieves higher rewards in most tasks while maintaining similar constraint satisfaction
Convergence Speed: Converges faster than classical Lagrangian methods
Competitive Performance: Comparable or superior to latest methods (FOCOPS, CUP)

Detailed Result Analysis

Ant-v1 and Humanoid-v1: NPG-PD uniformly outperforms other four methods
HalfCheetah-v1 and Walker2d-v1: NPG-PD performance similar to PPOLag, both superior to other methods
Hopper-v1 and Swimmer-v1: NPG-PD competes closely with FOCOPS and CUP, achieving higher rewards despite early oscillations

Constrained MDP Algorithm Development

Early Work: Lagrangian-based methods (Altman 1999, Borkar 2005)
Local Convergence Methods: CPG, accelerated PDPO, CPO, etc.
Global Convergence Research: This paper is the first to provide finite-time global convergence guarantees

Policy Gradient Methods

Unconstrained Convergence Theory: Agarwal et al. (2021), etc.
Constrained Optimization Challenges: Additional difficulties in handling non-convex constraint sets

Conclusions and Discussion

Main Conclusions

Theoretical Breakthrough: First to provide dimension-free global convergence guarantees for policy gradient methods in constrained MDPs
Practical Algorithm: NPG-PD algorithm is simple and effective, suitable for large-scale problems
Function Approximation Theory: Establishes comprehensive function approximation error analysis framework

Limitations

Oscillatory Behavior: Early oscillations common in primal-dual methods
Slater Condition: Requires strict feasibility assumption
Softmax Parameterization: Strongest results only apply to specific parameterization

Future Directions

Policy Iteration Convergence: Study policy iteration convergence for single-timescale algorithms
Regularization Techniques: Introduce regularization to eliminate oscillatory behavior
Continuous Space Extension: Extend to continuous state-action spaces
Robustness Analysis: Consider effects of model uncertainty

In-Depth Evaluation

Strengths

Theoretical Innovation: First to establish dimension-free global convergence theory for constrained MDPs
Technical Depth: Cleverly combines online learning and constrained optimization techniques
Comprehensive Analysis: Complete theoretical framework from tabular case to function approximation
Experimental Validation: Verifies theoretical predictions on practical robotic tasks

Weaknesses

Parameterization Restrictions: Strongest theoretical results apply only to softmax parameterization
Limited Experimental Scope: Experiments primarily concentrated in robotic control domain
Convergence Rate: Sublinear convergence rate may be slow in practical applications
Oscillation Problem: Insufficient resolution of oscillation phenomena in primal-dual methods

Impact

Theoretical Contribution: Provides important theoretical foundation for constrained reinforcement learning
Methodological Value: NPG-PD framework extensible to other constrained optimization problems
Practical Value: Algorithm is simple to implement and suitable for engineering applications
Future Research: Establishes theoretical foundation for subsequent research in this field

Applicable Scenarios

Safety-Critical Systems: Autonomous driving, medical robotics requiring hard constraints
Resource-Constrained Environments: Online learning with limited computational and storage resources
Large-Scale MDPs: Complex decision problems with massive state-action spaces
Multi-Objective Optimization: Applications requiring balance among multiple performance metrics

Multiplicative Weight Update Connection: Interprets NPG update as Multiplicative Weights Update in online learning
Fisher Information Matrix Inverse: Leverages softmax structure to simplify NPG computation
Strong Duality: Establishes strong duality relationship under Slater condition
Constraint Violation Bound: Bounds constraint violation through convex analysis techniques

This paper makes important contributions to constrained reinforcement learning theory, providing a solid theoretical foundation and practical algorithmic framework for the field's development.