Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs
Ding, Zhang, Duan et al.
We study the sequential decision making problem of maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization, we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. We use a set of computational experiments to showcase the effectiveness of our approach.
academic
Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs
This paper investigates the sequential decision-making problem of maximizing expected cumulative rewards subject to expected cumulative utility constraints. The authors employ natural policy gradient methods to solve the discounted infinite-horizon optimal control problem for constrained Markov Decision Processes (constrained MDPs). Specifically, they propose a novel Natural Policy Gradient Primal-Dual (NPG-PD) method that updates primal variables via natural policy gradient ascent and dual variables via projected subgradient descent. Despite the underlying maximization problem involving non-concave objectives and non-convex constraint sets, the method achieves sublinear global convergence rates in both optimality gap and constraint violation under softmax policy parameterization, independent of the state-action space size (dimension-free). Furthermore, for log-linear and general smooth policy parameterizations, sublinear convergence rates are established up to function approximation errors induced by restricted policy parameterization.
Theorem 10 (Global Convergence under Softmax Parameterization):
Under Slater condition, with η1=2log∣A∣, η2=2(1−γ)/T, the NPG-PD algorithm satisfies:
Theorem 16 (Log-Linear Parameterization):
Under function approximation setting, the convergence rate is:
E[T1∑t=0T−1(Vr∗(ρ)−Vr(t)(ρ))]≤(1−γ)5C3T1+Function Approximation Error
Lemma 11 (Non-Monotonic Improvement): Each primal variable update improves the Lagrangian term, but reward and utility functions themselves may be non-monotonic.
Lemma 12 (Bounded Average Performance): Establishes average performance bounds through regret analysis.
Multiplicative Weight Update Connection: Interprets NPG update as Multiplicative Weights Update in online learning
Fisher Information Matrix Inverse: Leverages softmax structure to simplify NPG computation
Strong Duality: Establishes strong duality relationship under Slater condition
Constraint Violation Bound: Bounds constraint violation through convex analysis techniques
This paper makes important contributions to constrained reinforcement learning theory, providing a solid theoretical foundation and practical algorithmic framework for the field's development.