2025-11-10T02:30:58.102691

Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Hu, Chen, Huang

Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.

academic

Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

基本信息

论文ID: 2510.12334
标题: Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
作者: Rui Hu, Yu Chen, Longbo Huang (清华大学IIIS)
分类: cs.LG (机器学习), cs.AI (人工智能)
发表时间: 2025年10月14日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.12334v1

摘要

许多流行的强化学习算法采用演化奖励函数——通过奖励塑形、熵正则化或课程学习等技术——但其理论基础仍不完善。本文首次提供了在马尔可夫采样下存在演化奖励函数的单时间尺度Actor-Critic算法的有限时间收敛分析。研究考虑奖励参数在每个时间步都可能变化的设置，同时影响策略优化和价值估计。在标准假设下，推导出Actor和Critic误差的非渐近界。结果表明，在奖励参数演化足够缓慢的条件下，可以实现 $O(1/\sqrt{T})$ 的收敛率，与静态奖励的最佳已知率相匹配。当奖励通过有界梯度的基于梯度的规则在与Actor和Critic相同的时间尺度上更新时，这一收敛率得以保持，为许多流行的强化学习技术提供了理论基础。

研究背景与动机

问题背景

理论与实践的差距: 强化学习理论通常建立在静态奖励函数的马尔可夫决策过程(MDP)基础上，但实际应用中广泛使用演化奖励技术
演化奖励的普遍性: 实际RL算法中普遍采用奖励塑形、熵正则化、课程学习等技术来改善学习效果
设计挑战: 在现实场景中设计既可学习又与期望任务对齐的奖励函数存在显著困难

核心问题

奖励函数可以以多快的速度变化而仍能保证RL算法的收敛性？

现有方法局限性

现有理论分析主要集中在静态奖励设置
缺乏对演化奖励下Actor-Critic算法收敛性的理论保证
马尔可夫采样下的分布不匹配分析有待改进

核心贡献

首创性理论分析: 提供了演化奖励下单时间尺度Actor-Critic算法的首个有限时间收敛分析
收敛率保证: 证明在奖励参数演化足够缓慢的条件下可实现 $O(1/\sqrt{T})$ 收敛率，与静态奖励情况相匹配
实用性验证: 证明基于梯度的奖励更新规则满足收敛条件，为实际RL技术提供理论支撑
技术改进: 引入新的马尔可夫采样下分布不匹配分析，将静态奖励情况下的收敛率改进了 $\log^2 T$ 因子

方法详解

任务定义

研究无限视野折扣马尔可夫决策过程 $M = (S,A,P,r,\gamma)$ ，其中奖励函数 $r$ 可能随时间演化。目标是在演化奖励设置下分析Actor-Critic算法的收敛性。

模型架构

1. 演化奖励框架

引入通用奖励参数 $\phi$ ，包含所有决定正则化奖励 $\tilde{r}_{\phi,\theta}(s,a)$ 的因子： $\tilde{r}_{\phi,\theta}(s,a) = r(s,a) - \alpha \log \pi_\theta(a|s)$

其中 $\alpha \geq 0$ 是熵正则化参数。

2. Actor-Critic更新规则

Actor更新: $\theta_{t+1} \leftarrow \theta_t + \eta_t^\theta \hat{\delta}_t \nabla_\theta \log \pi_\theta(a_t|s_t)$

Critic更新: $\omega_{t+1} \leftarrow \text{Proj}_{C_\omega}(\omega_t + \eta_t^\omega \hat{\delta}_t \phi(s_t))$

其中时序差分误差为： $\hat{\delta}_t = \tilde{r}_{\phi_t,\theta_t}(s_t,a_t) + (\gamma\phi(s'_t) - \phi(s_t))^\top \omega_t$

3. 马尔可夫采样策略

采用采样核 $\hat{P}(\cdot|s,a) = \gamma P(\cdot|s,a) + (1-\gamma)\rho(\cdot)$ 确保遍历性。

技术创新点

1. 演化奖励的Lipschitz连续性分析

建立策略目标 $J_\phi(\theta)$ 和最优Critic参数 $\omega^*(\phi,\theta)$ 关于奖励参数 $\phi$ 的Lipschitz连续性：

$J_\phi(\theta)$ 是 $D_J$ -Lipschitz关于 $\phi$
$\omega^*(\phi,\theta)$ 是 $D_\omega$ -Lipschitz关于 $\phi$

2. 新颖的分布不匹配分析

提出关键命题4.8，直接利用诱导算子在状态分布上的收缩性质： $E\|\hat{\nu}_t - \nu_\rho^{\pi_{\theta_t}}\|_1 \leq LC_\delta L_\nu \sum_{k=0}^{t-1} \gamma^{t-1-k}\eta_k^\theta + \gamma^t\|\rho - \nu_\rho^{\pi_{\theta_0}}\|_1$

3. 系统不等式求解

通过代数不等式 $2\sqrt{G_T W_T} \leq \frac{1-\gamma}{2L}G_T + \frac{2L}{1-\gamma}W_T$ 解耦Actor和Critic误差。

实验设置

理论分析框架

本文主要进行理论分析，采用以下设置：

评价指标

Actor误差: $G_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\nabla_\theta J_{\phi_t}(\theta_t)\|_2^2$
Critic误差: $W_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\omega_t - \omega_t^*\|_2^2$
奖励变化: $F_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\phi_{t+1} - \phi_t\|_2^2$

关键假设

充分探索 (假设4.1): 对任意 $\theta \in \Omega(\theta)$ ， $A_\theta$ 负定且奇异值上界为 $-\lambda$
策略Lipschitz连续性 (假设4.3): $\|\nabla_\theta \log \pi_\theta(a|s)\|_2 \leq L$
正则化奖励Lipschitz连续性 (假设4.5): 关于 $\phi$ 的Lipschitz常数为 $D$

实验结果

主要理论结果

定理4.6 (主要收敛定理)

在步长 $\eta_t^\theta = \frac{c_\theta}{\sqrt{t}}$ 和 $\eta_t^\omega = \frac{c_\omega}{\sqrt{t}}$ 且 $\frac{c_\theta}{c_\omega} \leq \frac{\lambda}{LS_\omega} \wedge \frac{1}{16LL_\omega}$ 条件下：

$G_T = O\left(\frac{1}{\sqrt{T}}\right) + O\left(F_T\sqrt{T}\right) + O\left(\sqrt{\frac{F_T}{T}}\right) + O(\epsilon)$

$W_T = O\left(\frac{1}{\sqrt{T}}\right) + O\left(F_T\sqrt{T}\right) + O\left(\sqrt{\frac{F_T}{T}}\right) + O(\epsilon)$

推论4.7 (梯度更新规则)

当奖励参数采用梯度更新规则 $\phi_{t+1} \leftarrow \phi_t + \eta_t^\phi h_\phi(t)$ ，且 $E\|h_\phi(t)\|_2^2 \leq C_\phi^2$ ， $\eta_t^\phi = \frac{c_\phi}{t}$ 时：

$F_T = O\left(\frac{1}{T}\right) \Rightarrow G_T = O\left(\frac{1}{\sqrt{T}}\right) + O(\epsilon), \quad W_T = O\left(\frac{1}{\sqrt{T}}\right) + O(\epsilon)$