2025-11-20T12:37:14.096690

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Ding, Zhang, Duan et al.
We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.
academic

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Basic Information

  • Paper ID: 2206.02346
  • Title: Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs
  • Authors: Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, Mihailo R. Jovanović
  • Classification: math.OC cs.AI cs.LG cs.SY eess.SY
  • Published Journal: Journal of Machine Learning Research 26 (2025) 1-76
  • Paper Link: https://arxiv.org/abs/2206.02346

Abstract

This paper investigates the sequential decision-making problem of maximizing expected cumulative rewards subject to expected cumulative utility constraints. The authors employ natural policy gradient methods to solve the discounted infinite-horizon optimal control problem for constrained Markov Decision Processes (constrained MDPs). Specifically, they propose a novel Natural Policy Gradient Primal-Dual (NPG-PD) method that updates primal variables via natural policy gradient ascent and dual variables via projected subgradient descent. Despite the underlying maximization problem involving non-concave objectives and non-convex constraint sets, the method achieves sublinear global convergence rates in both optimality gap and constraint violation under softmax policy parameterization, independent of the state-action space size (dimension-free). Furthermore, for log-linear and general smooth policy parameterizations, sublinear convergence rates are established up to function approximation errors induced by restricted policy parameterization.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is optimal policy learning in Constrained MDPs:

  • Objective: Maximize expected cumulative reward Vrπ(ρ)V^π_r(ρ)
  • Constraints: Satisfy expected cumulative utility constraints Vgπ(ρ)bV^π_g(ρ) ≥ b
  • Challenges: Non-concave objective function, non-convex constraint set

Significance

Constrained MDPs are crucial in safety-critical applications:

  1. Autonomous Driving: Requires maximizing performance while ensuring safety constraints
  2. Robotics: Must satisfy physical and safety limitations during task execution
  3. Cybersecurity: Maintains security policies while optimizing system performance
  4. Financial Management: Pursues returns while controlling risk exposure

Limitations of Existing Methods

  1. Insufficient Theoretical Guarantees: Most existing methods provide only asymptotic or local convergence guarantees
  2. Dimension Dependence: Convergence rates typically depend on state-action space size
  3. Function Approximation Errors: Lack rigorous analysis under function approximation
  4. Sample Complexity: Absence of finite-sample complexity theoretical guarantees

Core Contributions

  1. Proposes NPG-PD Algorithm: Designs a novel algorithmic framework combining natural policy gradient and primal-dual methods
  2. Global Convergence Guarantees: Proves dimension-free global convergence under softmax parameterization
  3. Function Approximation Theory: Establishes convergence theory for log-linear and general smooth policy parameterizations
  4. Sample Complexity Analysis: Provides finite-sample complexity guarantees for two sample-based NPG-PD algorithms
  5. Experimental Validation: Verifies method effectiveness through robotic simulation tasks

Methodology Details

Task Definition

Constrained MDP is defined as a seven-tuple (S,A,P,r,g,b,γ,ρ)(\mathcal{S}, \mathcal{A}, P, r, g, b, γ, ρ):

  • S\mathcal{S}: Finite state space
  • A\mathcal{A}: Finite action space
  • PP: Transition probability
  • r,gr, g: Reward and utility functions
  • bb: Constraint threshold
  • γγ: Discount factor
  • ρρ: Initial state distribution

Optimization Problem: maxπΠVrπ(ρ)s.t.Vgπ(ρ)b\max_{π ∈ Π} V^π_r(ρ) \quad \text{s.t.} \quad V^π_g(ρ) ≥ b

Model Architecture

1. Lagrangian Duality

Transforms the constrained optimization problem into a saddle-point problem: maxπΠminλ0Vrπ(ρ)+λ(Vgπ(ρ)b)\max_{π ∈ Π} \min_{λ ≥ 0} V^π_r(ρ) + λ(V^π_g(ρ) - b)

2. NPG-PD Algorithm Core Updates

Primal Variable Update (Natural Policy Gradient): θ(t+1)=θ(t)+η1Fρ(θ(t))θVLθ(t),λ(t)(ρ)θ^{(t+1)} = θ^{(t)} + η_1 F^†_ρ(θ^{(t)})∇_θ V^{θ^{(t)},λ^{(t)}}_L(ρ)

Dual Variable Update (Projected Subgradient Descent): λ(t+1)=PΛ(λ(t)η2(Vgθ(t)(ρ)b))λ^{(t+1)} = P_Λ\left(λ^{(t)} - η_2(V^{θ^{(t)}}_g(ρ) - b)\right)

Where:

  • Fρ(θ)F^†_ρ(θ): Moore-Penrose inverse of Fisher information matrix
  • PΛP_Λ: Projection onto interval [0,2/((1γ)ξ)][0, 2/((1-γ)ξ)]

3. Simplified Form Under Softmax Policy Parameterization

Under softmax parameterization πθ(as)=exp(θs,a)aexp(θs,a)π_θ(a|s) = \frac{\exp(θ_{s,a})}{\sum_{a'} \exp(θ_{s,a'})}, the update simplifies to:

θs,a(t+1)=θs,a(t)+η11γAL(t)(s,a)θ^{(t+1)}_{s,a} = θ^{(t)}_{s,a} + \frac{η_1}{1-γ}A^{(t)}_L(s,a)

Equivalent to multiplicative weight update: π(t+1)(as)=π(t)(as)exp(η11γAL(t)(s,a))Z(t)(s)π^{(t+1)}(a|s) = \frac{π^{(t)}(a|s)\exp\left(\frac{η_1}{1-γ}A^{(t)}_L(s,a)\right)}{Z^{(t)}(s)}

Technical Innovations

  1. Dimension-Free Convergence: Leverages softmax structure to achieve convergence rates independent of state-action space size
  2. Non-Convex Constraint Handling: Processes non-convex constraint sets through novel primal-dual analysis
  3. Function Approximation Error Decomposition: Introduces estimation-propagation error decomposition framework
  4. Regret-Type Analysis: Employs regret analysis techniques from online learning

Theoretical Results

Main Convergence Theorem

Theorem 10 (Global Convergence under Softmax Parameterization): Under Slater condition, with η1=2logAη_1 = 2\log|A|, η2=2(1γ)/Tη_2 = 2(1-γ)/\sqrt{T}, the NPG-PD algorithm satisfies:

Optimality Gap: 1Tt=0T1(Vr(ρ)Vr(t)(ρ))7(1γ)21T\frac{1}{T}\sum_{t=0}^{T-1}(V^*_r(ρ) - V^{(t)}_r(ρ)) ≤ \frac{7}{(1-γ)^2}\frac{1}{\sqrt{T}}

Constraint Violation: [1Tt=0T1(bVg(t)(ρ))]+2ξ+4ξ(1γ)21T\left[\frac{1}{T}\sum_{t=0}^{T-1}(b - V^{(t)}_g(ρ))\right]_+ ≤ \frac{2}{ξ} + \frac{4ξ}{(1-γ)^2}\frac{1}{\sqrt{T}}

Function Approximation Case

Theorem 16 (Log-Linear Parameterization): Under function approximation setting, the convergence rate is: E[1Tt=0T1(Vr(ρ)Vr(t)(ρ))]C3(1γ)51T+Function Approximation ErrorE\left[\frac{1}{T}\sum_{t=0}^{T-1}(V^*_r(ρ) - V^{(t)}_r(ρ))\right] ≤ \frac{C_3}{(1-γ)^5}\frac{1}{\sqrt{T}} + \text{Function Approximation Error}

Sample Complexity

Theorem 28/29 (Sample Complexity):

  • Iteration Complexity: O(1/ε2)O(1/ε^2)
  • Sample Complexity: O(1/ε4)O(1/ε^4)

This represents significant improvement over previous O(1/ε8)O(1/ε^8) results.

Experimental Setup

Robotic Simulation Tasks

Uses six robotic locomotion tasks in MuJoCo environment:

  • Ant-v1, Humanoid-v1, HalfCheetah-v1, Walker2d-v1, Hopper-v1, Swimmer-v1
  • Constraints: Movement velocity not exceeding specified threshold (safety constraint)

Comparison Methods

  1. Classical Primal-Dual Methods: TRPOLag, PPOLag
  2. Latest Policy Optimization Methods: CUP, FOCOPS

Evaluation Metrics

  • Average Reward: Task performance
  • Average Cost: Constraint violation degree (average velocity)

Experimental Results

Main Findings

  1. Performance Advantage: NPG-PD achieves higher rewards in most tasks while maintaining similar constraint satisfaction
  2. Convergence Speed: Converges faster than classical Lagrangian methods
  3. Competitive Performance: Comparable or superior to latest methods (FOCOPS, CUP)

Detailed Result Analysis

  • Ant-v1 and Humanoid-v1: NPG-PD uniformly outperforms other four methods
  • HalfCheetah-v1 and Walker2d-v1: NPG-PD performance similar to PPOLag, both superior to other methods
  • Hopper-v1 and Swimmer-v1: NPG-PD competes closely with FOCOPS and CUP, achieving higher rewards despite early oscillations

Constrained MDP Algorithm Development

  1. Early Work: Lagrangian-based methods (Altman 1999, Borkar 2005)
  2. Local Convergence Methods: CPG, accelerated PDPO, CPO, etc.
  3. Global Convergence Research: This paper is the first to provide finite-time global convergence guarantees

Policy Gradient Methods

  1. Unconstrained Convergence Theory: Agarwal et al. (2021), etc.
  2. Constrained Optimization Challenges: Additional difficulties in handling non-convex constraint sets

Conclusions and Discussion

Main Conclusions

  1. Theoretical Breakthrough: First to provide dimension-free global convergence guarantees for policy gradient methods in constrained MDPs
  2. Practical Algorithm: NPG-PD algorithm is simple and effective, suitable for large-scale problems
  3. Function Approximation Theory: Establishes comprehensive function approximation error analysis framework

Limitations

  1. Oscillatory Behavior: Early oscillations common in primal-dual methods
  2. Slater Condition: Requires strict feasibility assumption
  3. Softmax Parameterization: Strongest results only apply to specific parameterization

Future Directions

  1. Policy Iteration Convergence: Study policy iteration convergence for single-timescale algorithms
  2. Regularization Techniques: Introduce regularization to eliminate oscillatory behavior
  3. Continuous Space Extension: Extend to continuous state-action spaces
  4. Robustness Analysis: Consider effects of model uncertainty

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: First to establish dimension-free global convergence theory for constrained MDPs
  2. Technical Depth: Cleverly combines online learning and constrained optimization techniques
  3. Comprehensive Analysis: Complete theoretical framework from tabular case to function approximation
  4. Experimental Validation: Verifies theoretical predictions on practical robotic tasks

Weaknesses

  1. Parameterization Restrictions: Strongest theoretical results apply only to softmax parameterization
  2. Limited Experimental Scope: Experiments primarily concentrated in robotic control domain
  3. Convergence Rate: Sublinear convergence rate may be slow in practical applications
  4. Oscillation Problem: Insufficient resolution of oscillation phenomena in primal-dual methods

Impact

  1. Theoretical Contribution: Provides important theoretical foundation for constrained reinforcement learning
  2. Methodological Value: NPG-PD framework extensible to other constrained optimization problems
  3. Practical Value: Algorithm is simple to implement and suitable for engineering applications
  4. Future Research: Establishes theoretical foundation for subsequent research in this field

Applicable Scenarios

  1. Safety-Critical Systems: Autonomous driving, medical robotics requiring hard constraints
  2. Resource-Constrained Environments: Online learning with limited computational and storage resources
  3. Large-Scale MDPs: Complex decision problems with massive state-action spaces
  4. Multi-Objective Optimization: Applications requiring balance among multiple performance metrics

Technical Details Supplement

Key Lemmas

Lemma 11 (Non-Monotonic Improvement): Each primal variable update improves the Lagrangian term, but reward and utility functions themselves may be non-monotonic.

Lemma 12 (Bounded Average Performance): Establishes average performance bounds through regret analysis.

Proof Techniques

  1. Multiplicative Weight Update Connection: Interprets NPG update as Multiplicative Weights Update in online learning
  2. Fisher Information Matrix Inverse: Leverages softmax structure to simplify NPG computation
  3. Strong Duality: Establishes strong duality relationship under Slater condition
  4. Constraint Violation Bound: Bounds constraint violation through convex analysis techniques

This paper makes important contributions to constrained reinforcement learning theory, providing a solid theoretical foundation and practical algorithmic framework for the field's development.