2025-11-10T02:30:58.102691

Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Hu, Chen, Huang

Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.

academic

Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Basic Information

Paper ID: 2510.12334
Title: Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
Authors: Rui Hu, Yu Chen, Longbo Huang (Tsinghua University IIIS)
Classification: cs.LG (Machine Learning), cs.AI (Artificial Intelligence)
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12334v1

Abstract

Many popular reinforcement learning algorithms employ evolving reward functions through techniques such as reward shaping, entropy regularization, or curriculum learning, yet their theoretical foundations remain incomplete. This paper provides the first finite-time convergence analysis of single-timescale Actor-Critic algorithms with evolving reward functions under Markovian sampling. The study considers settings where reward parameters may change at each time step, simultaneously affecting both policy optimization and value estimation. Under standard assumptions, non-asymptotic bounds on Actor and Critic errors are derived. The results demonstrate that an $O(1/\sqrt{T})$ convergence rate can be achieved when reward parameters evolve sufficiently slowly, matching the best-known rate for static rewards. This convergence rate is maintained when rewards are updated via gradient-based rules on the same timescale as the Actor and Critic, providing theoretical foundations for many popular reinforcement learning techniques.

Research Background and Motivation

Problem Background

Theory-Practice Gap: Reinforcement learning theory is typically established on Markov Decision Processes (MDPs) with static reward functions, while practical applications widely employ evolving reward techniques
Prevalence of Evolving Rewards: Practical RL algorithms commonly use reward shaping, entropy regularization, and curriculum learning to improve learning performance
Design Challenges: Designing reward functions that are both learnable and aligned with desired tasks presents significant difficulties in real-world scenarios

Core Research Question

How rapidly can reward functions evolve while still guaranteeing convergence of RL algorithms?

Limitations of Existing Approaches

Existing theoretical analyses primarily focus on static reward settings
Lack of theoretical convergence guarantees for Actor-Critic algorithms under evolving rewards
Distribution mismatch analysis under Markovian sampling requires improvement

Core Contributions

Pioneering Theoretical Analysis: Provides the first finite-time convergence analysis of single-timescale Actor-Critic algorithms with evolving rewards
Convergence Rate Guarantee: Proves that an $O(1/\sqrt{T})$ convergence rate can be achieved when reward parameters evolve sufficiently slowly, matching the static reward case
Practical Validation: Demonstrates that gradient-based reward update rules satisfy convergence conditions, providing theoretical support for practical RL techniques
Technical Improvements: Introduces novel distribution mismatch analysis under Markovian sampling, improving the convergence rate by a factor of $\log^2 T$ compared to the static reward case

Methodology Details

Task Definition

The paper studies infinite-horizon discounted Markov Decision Processes $M = (S,A,P,r,\gamma)$ where the reward function $r$ may evolve over time. The objective is to analyze Actor-Critic algorithm convergence under evolving reward settings.

Model Architecture

1. Evolving Reward Framework

Introduces a generic reward parameter $\phi$ encompassing all factors determining the regularized reward $\tilde{r}_{\phi,\theta}(s,a)$ : $\tilde{r}_{\phi,\theta}(s,a) = r(s,a) - \alpha \log \pi_\theta(a|s)$

where $\alpha \geq 0$ is the entropy regularization parameter.

2. Actor-Critic Update Rules

Actor Update: $\theta_{t+1} \leftarrow \theta_t + \eta_t^\theta \hat{\delta}_t \nabla_\theta \log \pi_\theta(a_t|s_t)$

Critic Update: $\omega_{t+1} \leftarrow \text{Proj}_{C_\omega}(\omega_t + \eta_t^\omega \hat{\delta}_t \phi(s_t))$

where the temporal difference error is: $\hat{\delta}_t = \tilde{r}_{\phi_t,\theta_t}(s_t,a_t) + (\gamma\phi(s'_t) - \phi(s_t))^\top \omega_t$

3. Markovian Sampling Strategy

Employs a sampling kernel $\hat{P}(\cdot|s,a) = \gamma P(\cdot|s,a) + (1-\gamma)\rho(\cdot)$ to ensure ergodicity.

Technical Innovations

1. Lipschitz Continuity Analysis of Evolving Rewards

Establishes Lipschitz continuity of the policy objective $J_\phi(\theta)$ and optimal Critic parameters $\omega^*(\phi,\theta)$ with respect to reward parameter $\phi$ :

$J_\phi(\theta)$ is $D_J$ -Lipschitz with respect to $\phi$
$\omega^*(\phi,\theta)$ is $D_\omega$ -Lipschitz with respect to $\phi$

2. Novel Distribution Mismatch Analysis

Proposes key Proposition 4.8, directly exploiting the contraction property of the induced operator on state distributions: $E\|\hat{\nu}_t - \nu_\rho^{\pi_{\theta_t}}\|_1 \leq LC_\delta L_\nu \sum_{k=0}^{t-1} \gamma^{t-1-k}\eta_k^\theta + \gamma^t\|\rho - \nu_\rho^{\pi_{\theta_0}}\|_1$

3. Systematic Inequality Resolution

Decouples Actor and Critic errors through the algebraic inequality $2\sqrt{G_T W_T} \leq \frac{1-\gamma}{2L}G_T + \frac{2L}{1-\gamma}W_T$ .

Experimental Setup

Theoretical Analysis Framework

This paper primarily conducts theoretical analysis with the following setup:

Evaluation Metrics

Actor Error: $G_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\nabla_\theta J_{\phi_t}(\theta_t)\|_2^2$
Critic Error: $W_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\omega_t - \omega_t^*\|_2^2$
Reward Variation: $F_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\phi_{t+1} - \phi_t\|_2^2$

Key Assumptions

Sufficient Exploration (Assumption 4.1): For any $\theta \in \Omega(\theta)$ , $A_\theta$ is negative definite with singular values bounded by $-\lambda$
Policy Lipschitz Continuity (Assumption 4.3): $\|\nabla_\theta \log \pi_\theta(a|s)\|_2 \leq L$
Regularized Reward Lipschitz Continuity (Assumption 4.5): Lipschitz constant with respect to $\phi$ is $D$

Experimental Results

Main Theoretical Results

Theorem 4.6 (Main Convergence Theorem)

Under step sizes $\eta_t^\theta = \frac{c_\theta}{\sqrt{t}}$ and $\eta_t^\omega = \frac{c_\omega}{\sqrt{t}}$ with $\frac{c_\theta}{c_\omega} \leq \frac{\lambda}{LS_\omega} \wedge \frac{1}{16LL_\omega}$ :

$G_T = O\left(\frac{1}{\sqrt{T}}\right) + O\left(F_T\sqrt{T}\right) + O\left(\sqrt{\frac{F_T}{T}}\right) + O(\epsilon)$

$W_T = O\left(\frac{1}{\sqrt{T}}\right) + O\left(F_T\sqrt{T}\right) + O\left(\sqrt{\frac{F_T}{T}}\right) + O(\epsilon)$

Corollary 4.7 (Gradient Update Rule)

When reward parameters are updated via gradient rule $\phi_{t+1} \leftarrow \phi_t + \eta_t^\phi h_\phi(t)$ with $E\|h_\phi(t)\|_2^2 \leq C_\phi^2$ and $\eta_t^\phi = \frac{c_\phi}{t}$ :

$F_T = O\left(\frac{1}{T}\right) \Rightarrow G_T = O\left(\frac{1}{\sqrt{T}}\right) + O(\epsilon), \quad W_T = O\left(\frac{1}{\sqrt{T}}\right) + O(\epsilon)$

Key Findings

1. Convergence Conditions

Asymptotic Convergence: Requires $F_T = o(1/\sqrt{T})$
Maintaining $O(1/\sqrt{T})$ Convergence Rate: Requires $F_T = O(1/T)$

2. Improvement for Static Rewards

When $F_T \equiv 0$ , the algorithm achieves the standard $O(1/\sqrt{T})$ convergence rate, eliminating the $\log^2 T$ factor compared to prior work.

3. Practical Validation

Demonstrates that a broad range of practical techniques, including curiosity-driven reward shaping, random network distillation, and Soft Actor-Critic automatic entropy adjustment, satisfy the theoretical guarantee conditions.

Finite-time Analysis of Policy Gradient Methods

Agarwal et al. (2021), Mei et al. (2020): Convergence guarantees under exact gradient oracle assumptions
Liu et al. (2020), Ding et al. (2022): Sample complexity under stochastic settings

Finite-time Analysis of Actor-Critic Methods

Two-loop Settings: Yang et al. (2019), Kumar et al. (2023)
Two-timescale: Wu et al. (2020), Xu et al. (2020b)
Single-timescale: Chen et al. (2021), Olshevsky & Gharesifard (2023), Chen & Zhao (2025)

Evolving Reward Techniques

Reward Shaping: Ng et al. (1999), Pathak et al. (2017), Burda et al. (2019)
Entropy/KL Regularization: Haarnoja et al. (2018a,b), Jaques et al. (2019)
Curriculum Learning: Narvekar et al. (2020)

Conclusions and Discussion

Main Conclusions

Single-timescale Actor-Critic algorithms exhibit significant robustness to reward non-stationarity
The standard $O(1/\sqrt{T})$ convergence rate can be maintained when reward parameter evolution is controlled
Gradient-based reward updates satisfy theoretical guarantee conditions, providing theoretical foundations for practical success

Limitations

Analysis is restricted to linear function approximation for the Critic
Requires satisfaction of standard assumptions such as Lipschitz continuity
Reward change speed must be strictly controlled

Future Directions

Extension to nonlinear function approximation, particularly neural networks
Exploration of insights from theoretical findings for designing more effective and provably stable reward shaping algorithms
Analysis of reinforcement learning under dynamic objectives (evolving rewards, changing initial distributions, or transition probabilities)

In-depth Evaluation

Strengths

Pioneering Contribution: First theoretical analysis of Actor-Critic algorithms with evolving rewards
Technical Rigor: Complete proofs, reasonable assumptions, and in-depth analysis
Practical Value: Provides theoretical support for widely-used RL techniques
Methodological Innovation: Improvements in distribution mismatch analysis have independent value

Weaknesses

Limited Scope: Restricted to linear function approximation; practical applications predominantly use deep neural networks
Assumption Constraints: Lipschitz continuity and other assumptions may be difficult to verify in practice
Experimental Validation: Lacks numerical experiments to validate theoretical results

Impact

Theoretical Contribution: Fills a gap in theoretical analysis of evolving reward RL
Practical Guidance: Provides theoretical principles for algorithm design
Future Research: Establishes foundations for extension to more complex settings

Applicable Scenarios

RL algorithm design requiring theoretical guarantees
Theoretical analysis of reward shaping and curriculum learning
Convergence analysis of adaptive entropy regularization algorithms

References

The paper cites important works in reinforcement learning theoretical analysis, including:

Sutton & Barto (1998): Foundational RL theory
Chen et al. (2021), Olshevsky & Gharesifard (2023): Single-timescale Actor-Critic analysis
Haarnoja et al. (2018): Soft Actor-Critic algorithm
Pathak et al. (2017): Curiosity-driven exploration

Overall Assessment: This is a high-quality theoretical paper that provides the first rigorous convergence analysis of Actor-Critic algorithms with evolving rewards. While it has certain limitations in scope, its theoretical contributions are significant and provide important theoretical foundations for understanding and designing practical RL algorithms.