Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
Hu, Chen, Huang
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.
academic
Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
Many popular reinforcement learning algorithms employ evolving reward functions through techniques such as reward shaping, entropy regularization, or curriculum learning, yet their theoretical foundations remain incomplete. This paper provides the first finite-time convergence analysis of single-timescale Actor-Critic algorithms with evolving reward functions under Markovian sampling. The study considers settings where reward parameters may change at each time step, simultaneously affecting both policy optimization and value estimation. Under standard assumptions, non-asymptotic bounds on Actor and Critic errors are derived. The results demonstrate that an O(1/T) convergence rate can be achieved when reward parameters evolve sufficiently slowly, matching the best-known rate for static rewards. This convergence rate is maintained when rewards are updated via gradient-based rules on the same timescale as the Actor and Critic, providing theoretical foundations for many popular reinforcement learning techniques.
Theory-Practice Gap: Reinforcement learning theory is typically established on Markov Decision Processes (MDPs) with static reward functions, while practical applications widely employ evolving reward techniques
Prevalence of Evolving Rewards: Practical RL algorithms commonly use reward shaping, entropy regularization, and curriculum learning to improve learning performance
Design Challenges: Designing reward functions that are both learnable and aligned with desired tasks presents significant difficulties in real-world scenarios
Pioneering Theoretical Analysis: Provides the first finite-time convergence analysis of single-timescale Actor-Critic algorithms with evolving rewards
Convergence Rate Guarantee: Proves that an O(1/T) convergence rate can be achieved when reward parameters evolve sufficiently slowly, matching the static reward case
Practical Validation: Demonstrates that gradient-based reward update rules satisfy convergence conditions, providing theoretical support for practical RL techniques
Technical Improvements: Introduces novel distribution mismatch analysis under Markovian sampling, improving the convergence rate by a factor of log2T compared to the static reward case
The paper studies infinite-horizon discounted Markov Decision Processes M=(S,A,P,r,γ) where the reward function r may evolve over time. The objective is to analyze Actor-Critic algorithm convergence under evolving reward settings.
Proposes key Proposition 4.8, directly exploiting the contraction property of the induced operator on state distributions:
E∥ν^t−νρπθt∥1≤LCδLν∑k=0t−1γt−1−kηkθ+γt∥ρ−νρπθ0∥1
Demonstrates that a broad range of practical techniques, including curiosity-driven reward shaping, random network distillation, and Soft Actor-Critic automatic entropy adjustment, satisfy the theoretical guarantee conditions.
Haarnoja et al. (2018): Soft Actor-Critic algorithm
Pathak et al. (2017): Curiosity-driven exploration
Overall Assessment: This is a high-quality theoretical paper that provides the first rigorous convergence analysis of Actor-Critic algorithms with evolving rewards. While it has certain limitations in scope, its theoretical contributions are significant and provide important theoretical foundations for understanding and designing practical RL algorithms.