2025-11-20T12:37:14.096690

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Ding, Zhang, Duan et al.

We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.

academic

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

基本信息

论文ID: 2206.02346
标题: Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs
作者: Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, Mihailo R. Jovanović
分类: math.OC cs.AI cs.LG cs.SY eess.SY
发表期刊: Journal of Machine Learning Research 26 (2025) 1-76
论文链接: https://arxiv.org/abs/2206.02346

摘要

本文研究在满足期望总效用约束的条件下最大化期望总奖励的序列决策问题。作者采用自然策略梯度方法求解约束马尔可夫决策过程(constrained MDPs)的折扣无限时域最优控制问题。具体提出了一种新的自然策略梯度原对偶(NPG-PD)方法，通过自然策略梯度上升更新原变量，通过投影次梯度下降更新对偶变量。尽管底层最大化问题涉及非凹目标函数和非凸约束集，但在softmax策略参数化下，该方法在最优性间隙和约束违反方面都达到了全局收敛的次线性率。这种收敛性独立于状态-动作空间的大小，即无维度依赖。此外，对于对数线性和一般光滑策略参数化，建立了直到由受限策略参数化引起的函数逼近误差的次线性收敛率。

研究背景与动机

问题定义

本文要解决的核心问题是约束马尔可夫决策过程(Constrained MDPs)中的最优策略学习问题：

目标：最大化期望总奖励 $V^π_r(ρ)$
约束：满足期望总效用约束 $V^π_g(ρ) ≥ b$
挑战：目标函数非凹，约束集非凸

重要性

约束MDPs在安全关键应用中具有重要意义：

自动驾驶：需要在最大化性能的同时保证安全约束
机器人学：在执行任务时必须满足物理和安全限制
网络安全：在优化系统性能的同时维护安全策略
金融管理：在追求收益的同时控制风险

现有方法局限性

理论保证不足：大多数现有方法只提供渐近收敛或局部收敛保证
维度依赖：收敛率通常依赖于状态-动作空间大小
函数逼近误差：缺乏对函数逼近情况下的严格分析
样本复杂度：缺乏有限样本复杂度的理论保证

核心贡献

提出NPG-PD算法：设计了结合自然策略梯度和原对偶方法的新算法框架
全局收敛保证：在softmax参数化下证明了维度无关的全局收敛性
函数逼近理论：为对数线性和一般光滑策略参数化建立了收敛理论
样本复杂度分析：提供了两种基于样本的NPG-PD算法的有限样本复杂度保证
实验验证：通过机器人仿真任务验证了方法的有效性

方法详解

任务定义

约束MDP定义为七元组 $(\mathcal{S}, \mathcal{A}, P, r, g, b, γ, ρ)$ ：

$\mathcal{S}$ ：有限状态空间
$\mathcal{A}$ ：有限动作空间
$P$ ：转移概率
$r, g$ ：奖励和效用函数
$b$ ：约束阈值
$γ$ ：折扣因子
$ρ$ ：初始状态分布

优化问题： $\max_{π ∈ Π} V^π_r(ρ) \quad \text{s.t.} \quad V^π_g(ρ) ≥ b$

模型架构

1. 拉格朗日对偶化

将约束优化问题转化为鞍点问题： $\max_{π ∈ Π} \min_{λ ≥ 0} V^π_r(ρ) + λ(V^π_g(ρ) - b)$

2. NPG-PD算法核心更新

原变量更新（自然策略梯度）： $θ^{(t+1)} = θ^{(t)} + η_1 F^†_ρ(θ^{(t)})∇_θ V^{θ^{(t)},λ^{(t)}}_L(ρ)$

对偶变量更新（投影次梯度下降）： $λ^{(t+1)} = P_Λ\left(λ^{(t)} - η_2(V^{θ^{(t)}}_g(ρ) - b)\right)$

其中：

$F^†_ρ(θ)$ ：Fisher信息矩阵的Moore-Penrose逆
$P_Λ$ ：投影到区间 $[0, 2/((1-γ)ξ)]$

3. Softmax策略参数化下的简化形式

在softmax参数化 $π_θ(a|s) = \frac{\exp(θ_{s,a})}{\sum_{a'} \exp(θ_{s,a'})}$ 下，更新简化为：

$θ^{(t+1)}_{s,a} = θ^{(t)}_{s,a} + \frac{η_1}{1-γ}A^{(t)}_L(s,a)$

等价于乘性权重更新： $π^{(t+1)}(a|s) = \frac{π^{(t)}(a|s)\exp\left(\frac{η_1}{1-γ}A^{(t)}_L(s,a)\right)}{Z^{(t)}(s)}$

技术创新点

维度无关收敛：利用softmax结构实现与状态-动作空间大小无关的收敛率
非凸约束处理：通过新的原对偶分析处理非凸约束集
函数逼近误差分解：引入估计-传递误差分解框架
遗憾型分析：采用在线学习中的遗憾分析技术

理论结果

主要收敛定理

定理10（Softmax参数化全局收敛）：在Slater条件下，选择 $η_1 = 2\log|A|$ ， $η_2 = 2(1-γ)/\sqrt{T}$ ，NPG-PD算法满足：

最优性间隙： $\frac{1}{T}\sum_{t=0}^{T-1}(V^*_r(ρ) - V^{(t)}_r(ρ)) ≤ \frac{7}{(1-γ)^2}\frac{1}{\sqrt{T}}$

约束违反： $\left[\frac{1}{T}\sum_{t=0}^{T-1}(b - V^{(t)}_g(ρ))\right]_+ ≤ \frac{2}{ξ} + \frac{4ξ}{(1-γ)^2}\frac{1}{\sqrt{T}}$

函数逼近情况

定理16（对数线性参数化）：在函数逼近设置下，收敛率为： $E\left[\frac{1}{T}\sum_{t=0}^{T-1}(V^*_r(ρ) - V^{(t)}_r(ρ))\right] ≤ \frac{C_3}{(1-γ)^5}\frac{1}{\sqrt{T}} + \text{函数逼近误差}$