2025-11-10T02:30:58.102691

Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Hu, Chen, Huang
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.
academic

Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Basic Information

  • Paper ID: 2510.12334
  • Title: Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
  • Authors: Rui Hu, Yu Chen, Longbo Huang (Tsinghua University IIIS)
  • Classification: cs.LG (Machine Learning), cs.AI (Artificial Intelligence)
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12334v1

Abstract

Many popular reinforcement learning algorithms employ evolving reward functions through techniques such as reward shaping, entropy regularization, or curriculum learning, yet their theoretical foundations remain incomplete. This paper provides the first finite-time convergence analysis of single-timescale Actor-Critic algorithms with evolving reward functions under Markovian sampling. The study considers settings where reward parameters may change at each time step, simultaneously affecting both policy optimization and value estimation. Under standard assumptions, non-asymptotic bounds on Actor and Critic errors are derived. The results demonstrate that an O(1/T)O(1/\sqrt{T}) convergence rate can be achieved when reward parameters evolve sufficiently slowly, matching the best-known rate for static rewards. This convergence rate is maintained when rewards are updated via gradient-based rules on the same timescale as the Actor and Critic, providing theoretical foundations for many popular reinforcement learning techniques.

Research Background and Motivation

Problem Background

  1. Theory-Practice Gap: Reinforcement learning theory is typically established on Markov Decision Processes (MDPs) with static reward functions, while practical applications widely employ evolving reward techniques
  2. Prevalence of Evolving Rewards: Practical RL algorithms commonly use reward shaping, entropy regularization, and curriculum learning to improve learning performance
  3. Design Challenges: Designing reward functions that are both learnable and aligned with desired tasks presents significant difficulties in real-world scenarios

Core Research Question

How rapidly can reward functions evolve while still guaranteeing convergence of RL algorithms?

Limitations of Existing Approaches

  1. Existing theoretical analyses primarily focus on static reward settings
  2. Lack of theoretical convergence guarantees for Actor-Critic algorithms under evolving rewards
  3. Distribution mismatch analysis under Markovian sampling requires improvement

Core Contributions

  1. Pioneering Theoretical Analysis: Provides the first finite-time convergence analysis of single-timescale Actor-Critic algorithms with evolving rewards
  2. Convergence Rate Guarantee: Proves that an O(1/T)O(1/\sqrt{T}) convergence rate can be achieved when reward parameters evolve sufficiently slowly, matching the static reward case
  3. Practical Validation: Demonstrates that gradient-based reward update rules satisfy convergence conditions, providing theoretical support for practical RL techniques
  4. Technical Improvements: Introduces novel distribution mismatch analysis under Markovian sampling, improving the convergence rate by a factor of log2T\log^2 T compared to the static reward case

Methodology Details

Task Definition

The paper studies infinite-horizon discounted Markov Decision Processes M=(S,A,P,r,γ)M = (S,A,P,r,\gamma) where the reward function rr may evolve over time. The objective is to analyze Actor-Critic algorithm convergence under evolving reward settings.

Model Architecture

1. Evolving Reward Framework

Introduces a generic reward parameter ϕ\phi encompassing all factors determining the regularized reward r~ϕ,θ(s,a)\tilde{r}_{\phi,\theta}(s,a): r~ϕ,θ(s,a)=r(s,a)αlogπθ(as)\tilde{r}_{\phi,\theta}(s,a) = r(s,a) - \alpha \log \pi_\theta(a|s)

where α0\alpha \geq 0 is the entropy regularization parameter.

2. Actor-Critic Update Rules

Actor Update: θt+1θt+ηtθδ^tθlogπθ(atst)\theta_{t+1} \leftarrow \theta_t + \eta_t^\theta \hat{\delta}_t \nabla_\theta \log \pi_\theta(a_t|s_t)

Critic Update: ωt+1ProjCω(ωt+ηtωδ^tϕ(st))\omega_{t+1} \leftarrow \text{Proj}_{C_\omega}(\omega_t + \eta_t^\omega \hat{\delta}_t \phi(s_t))

where the temporal difference error is: δ^t=r~ϕt,θt(st,at)+(γϕ(st)ϕ(st))ωt\hat{\delta}_t = \tilde{r}_{\phi_t,\theta_t}(s_t,a_t) + (\gamma\phi(s'_t) - \phi(s_t))^\top \omega_t

3. Markovian Sampling Strategy

Employs a sampling kernel P^(s,a)=γP(s,a)+(1γ)ρ()\hat{P}(\cdot|s,a) = \gamma P(\cdot|s,a) + (1-\gamma)\rho(\cdot) to ensure ergodicity.

Technical Innovations

1. Lipschitz Continuity Analysis of Evolving Rewards

Establishes Lipschitz continuity of the policy objective Jϕ(θ)J_\phi(\theta) and optimal Critic parameters ω(ϕ,θ)\omega^*(\phi,\theta) with respect to reward parameter ϕ\phi:

  • Jϕ(θ)J_\phi(\theta) is DJD_J-Lipschitz with respect to ϕ\phi
  • ω(ϕ,θ)\omega^*(\phi,\theta) is DωD_\omega-Lipschitz with respect to ϕ\phi

2. Novel Distribution Mismatch Analysis

Proposes key Proposition 4.8, directly exploiting the contraction property of the induced operator on state distributions: Eν^tνρπθt1LCδLνk=0t1γt1kηkθ+γtρνρπθ01E\|\hat{\nu}_t - \nu_\rho^{\pi_{\theta_t}}\|_1 \leq LC_\delta L_\nu \sum_{k=0}^{t-1} \gamma^{t-1-k}\eta_k^\theta + \gamma^t\|\rho - \nu_\rho^{\pi_{\theta_0}}\|_1

3. Systematic Inequality Resolution

Decouples Actor and Critic errors through the algebraic inequality 2GTWT1γ2LGT+2L1γWT2\sqrt{G_T W_T} \leq \frac{1-\gamma}{2L}G_T + \frac{2L}{1-\gamma}W_T.

Experimental Setup

Theoretical Analysis Framework

This paper primarily conducts theoretical analysis with the following setup:

Evaluation Metrics

  • Actor Error: GT=1T/2t=T/2T1EθJϕt(θt)22G_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\nabla_\theta J_{\phi_t}(\theta_t)\|_2^2
  • Critic Error: WT=1T/2t=T/2T1Eωtωt22W_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\omega_t - \omega_t^*\|_2^2
  • Reward Variation: FT=1T/2t=T/2T1Eϕt+1ϕt22F_T = \frac{1}{T/2}\sum_{t=T/2}^{T-1} E\|\phi_{t+1} - \phi_t\|_2^2

Key Assumptions

  1. Sufficient Exploration (Assumption 4.1): For any θΩ(θ)\theta \in \Omega(\theta), AθA_\theta is negative definite with singular values bounded by λ-\lambda
  2. Policy Lipschitz Continuity (Assumption 4.3): θlogπθ(as)2L\|\nabla_\theta \log \pi_\theta(a|s)\|_2 \leq L
  3. Regularized Reward Lipschitz Continuity (Assumption 4.5): Lipschitz constant with respect to ϕ\phi is DD

Experimental Results

Main Theoretical Results

Theorem 4.6 (Main Convergence Theorem)

Under step sizes ηtθ=cθt\eta_t^\theta = \frac{c_\theta}{\sqrt{t}} and ηtω=cωt\eta_t^\omega = \frac{c_\omega}{\sqrt{t}} with cθcωλLSω116LLω\frac{c_\theta}{c_\omega} \leq \frac{\lambda}{LS_\omega} \wedge \frac{1}{16LL_\omega}:

GT=O(1T)+O(FTT)+O(FTT)+O(ϵ)G_T = O\left(\frac{1}{\sqrt{T}}\right) + O\left(F_T\sqrt{T}\right) + O\left(\sqrt{\frac{F_T}{T}}\right) + O(\epsilon)

WT=O(1T)+O(FTT)+O(FTT)+O(ϵ)W_T = O\left(\frac{1}{\sqrt{T}}\right) + O\left(F_T\sqrt{T}\right) + O\left(\sqrt{\frac{F_T}{T}}\right) + O(\epsilon)

Corollary 4.7 (Gradient Update Rule)

When reward parameters are updated via gradient rule ϕt+1ϕt+ηtϕhϕ(t)\phi_{t+1} \leftarrow \phi_t + \eta_t^\phi h_\phi(t) with Ehϕ(t)22Cϕ2E\|h_\phi(t)\|_2^2 \leq C_\phi^2 and ηtϕ=cϕt\eta_t^\phi = \frac{c_\phi}{t}:

FT=O(1T)GT=O(1T)+O(ϵ),WT=O(1T)+O(ϵ)F_T = O\left(\frac{1}{T}\right) \Rightarrow G_T = O\left(\frac{1}{\sqrt{T}}\right) + O(\epsilon), \quad W_T = O\left(\frac{1}{\sqrt{T}}\right) + O(\epsilon)

Key Findings

1. Convergence Conditions

  • Asymptotic Convergence: Requires FT=o(1/T)F_T = o(1/\sqrt{T})
  • Maintaining O(1/T)O(1/\sqrt{T}) Convergence Rate: Requires FT=O(1/T)F_T = O(1/T)

2. Improvement for Static Rewards

When FT0F_T \equiv 0, the algorithm achieves the standard O(1/T)O(1/\sqrt{T}) convergence rate, eliminating the log2T\log^2 T factor compared to prior work.

3. Practical Validation

Demonstrates that a broad range of practical techniques, including curiosity-driven reward shaping, random network distillation, and Soft Actor-Critic automatic entropy adjustment, satisfy the theoretical guarantee conditions.

Finite-time Analysis of Policy Gradient Methods

  • Agarwal et al. (2021), Mei et al. (2020): Convergence guarantees under exact gradient oracle assumptions
  • Liu et al. (2020), Ding et al. (2022): Sample complexity under stochastic settings

Finite-time Analysis of Actor-Critic Methods

  • Two-loop Settings: Yang et al. (2019), Kumar et al. (2023)
  • Two-timescale: Wu et al. (2020), Xu et al. (2020b)
  • Single-timescale: Chen et al. (2021), Olshevsky & Gharesifard (2023), Chen & Zhao (2025)

Evolving Reward Techniques

  • Reward Shaping: Ng et al. (1999), Pathak et al. (2017), Burda et al. (2019)
  • Entropy/KL Regularization: Haarnoja et al. (2018a,b), Jaques et al. (2019)
  • Curriculum Learning: Narvekar et al. (2020)

Conclusions and Discussion

Main Conclusions

  1. Single-timescale Actor-Critic algorithms exhibit significant robustness to reward non-stationarity
  2. The standard O(1/T)O(1/\sqrt{T}) convergence rate can be maintained when reward parameter evolution is controlled
  3. Gradient-based reward updates satisfy theoretical guarantee conditions, providing theoretical foundations for practical success

Limitations

  1. Analysis is restricted to linear function approximation for the Critic
  2. Requires satisfaction of standard assumptions such as Lipschitz continuity
  3. Reward change speed must be strictly controlled

Future Directions

  1. Extension to nonlinear function approximation, particularly neural networks
  2. Exploration of insights from theoretical findings for designing more effective and provably stable reward shaping algorithms
  3. Analysis of reinforcement learning under dynamic objectives (evolving rewards, changing initial distributions, or transition probabilities)

In-depth Evaluation

Strengths

  1. Pioneering Contribution: First theoretical analysis of Actor-Critic algorithms with evolving rewards
  2. Technical Rigor: Complete proofs, reasonable assumptions, and in-depth analysis
  3. Practical Value: Provides theoretical support for widely-used RL techniques
  4. Methodological Innovation: Improvements in distribution mismatch analysis have independent value

Weaknesses

  1. Limited Scope: Restricted to linear function approximation; practical applications predominantly use deep neural networks
  2. Assumption Constraints: Lipschitz continuity and other assumptions may be difficult to verify in practice
  3. Experimental Validation: Lacks numerical experiments to validate theoretical results

Impact

  1. Theoretical Contribution: Fills a gap in theoretical analysis of evolving reward RL
  2. Practical Guidance: Provides theoretical principles for algorithm design
  3. Future Research: Establishes foundations for extension to more complex settings

Applicable Scenarios

  1. RL algorithm design requiring theoretical guarantees
  2. Theoretical analysis of reward shaping and curriculum learning
  3. Convergence analysis of adaptive entropy regularization algorithms

References

The paper cites important works in reinforcement learning theoretical analysis, including:

  • Sutton & Barto (1998): Foundational RL theory
  • Chen et al. (2021), Olshevsky & Gharesifard (2023): Single-timescale Actor-Critic analysis
  • Haarnoja et al. (2018): Soft Actor-Critic algorithm
  • Pathak et al. (2017): Curiosity-driven exploration

Overall Assessment: This is a high-quality theoretical paper that provides the first rigorous convergence analysis of Actor-Critic algorithms with evolving rewards. While it has certain limitations in scope, its theoretical contributions are significant and provide important theoretical foundations for understanding and designing practical RL algorithms.