2025-11-25T11:58:17.992104

Multi Timescale Stochastic Approximation: Stability and Convergence

Deb, Ganesh, Bhatnagar

This paper presents the first sufficient conditions that guarantee the stability and almost sure convergence of multi-timescale stochastic approximation (SA) iterates. It extends the existing results on one-timescale and two-timescale SA iterates to general $N$-timescale stochastic recursions, for any $N \geq 1$, using the ordinary differential equation (ODE) method. As an application, we study SA algorithms augmented with heavy-ball momentum in the context of Gradient Temporal Difference (GTD) learning. The added momentum introduces an auxiliary state evolving on an intermediate timescale, yielding a three-timescale recursion. We show that with appropriate momentum parameters, the scheme fits within our framework and converges almost surely to the same fixed point as baseline GTD. The stability and convergence of all iterates including the momentum state follow from our main results without ad hoc bounds. We then study off-policy actor-critic algorithms with a baseline learner, actor, and critic updated on separate timescales. In contrast to prior work, we eliminate projection steps from the actor update and instead use our framework to guarantee stability and almost sure convergence of all components. Finally, we extend the analysis to constrained policy optimization in the average reward setting, where the actor, critic, and dual variables evolve on three distinct timescales, and we verify that the resulting dynamics satisfy the conditions of our general theorem. These examples show how diverse reinforcement learning algorithms covering momentum acceleration, off-policy learning, and primal-dual methods-fit naturally into the proposed multi-timescale framework.

academic

Multi Timescale Stochastic Approximation: Stability and Convergence

Basic Information

Paper ID: 2112.03515
Title: Multi Timescale Stochastic Approximation: Stability and Convergence
Authors: Rohan Deb (University of Illinois, Urbana-Champaign), Swetha Ganesh (Purdue University, West Lafayette), Shalabh Bhatnagar (Indian Institute of Science, Bengaluru)
Classification: eess.SY cs.SY
Publication Date: October 16, 2025 (arXiv version)
Paper Link: https://arxiv.org/abs/2112.03515v3

Abstract

This paper presents the first sufficient conditions guaranteeing stability and almost sure convergence of multi-timescale stochastic approximation (SA) iterations. The work extends existing single-timescale and two-timescale SA results to general N-timescale stochastic recursions for arbitrary N≥1, employing the ordinary differential equation (ODE) method. As an application, the authors investigate SA algorithms with heavy-ball momentum enhancement in gradient temporal-difference (GTD) learning. The added momentum introduces auxiliary states evolving on an intermediate timescale, yielding three-timescale recursions. Under appropriate momentum parameters, the scheme is shown to conform to the framework and converge almost surely to the same fixed point as baseline GTD.

Research Background and Motivation

Problem Definition

Stochastic approximation algorithms constitute a class of iterative processes for finding function zeros when the true function is unknown but noisy observations are available. In numerous optimization and control problems under uncertainty, algorithms involving three or more timescale recursions naturally arise.

Research Significance

Practical Demand: In reinforcement learning, multi-timescale algorithms naturally emerge in actor-critic algorithms for constrained Markov decision processes, hierarchical reinforcement learning, and related scenarios
Theoretical Gap: Existing literature provides stability and convergence conditions only for single-timescale and two-timescale SA, lacking general theory for N>2 cases

Limitations of Existing Methods

Stability Assumptions: Existing two-timescale analyses assume iterations remain stable (bounded), a non-trivial requirement
Verification Difficulty: Only recently have conditions emerged to verify such stability requirements
Restricted Applicability: Cannot handle complex algorithms with three or more timescales

Research Motivation

Provide the first set of sufficient conditions ensuring stability and convergence of general N-timescale stochastic recursions, filling theoretical gaps and supporting analysis of complex reinforcement learning algorithms.

Core Contributions

Theoretical Breakthrough: Presents the first sufficient conditions guaranteeing stability and almost sure convergence of N-timescale SA iterations
Method Extension: Generalizes Borkar-Meyn single-timescale and Lakshminarayanan-Bhatnagar two-timescale results to arbitrary N≥1
Application Verification: Validates framework effectiveness in three important reinforcement learning scenarios:
- Gradient temporal-difference (GTD) learning with momentum
- Off-policy actor-critic algorithms
- Constrained policy optimization
Technical Innovation: Eliminates projection steps in actor updates, relying solely on convergence framework to ensure stability

Methodology Details

Task Definition

Consider N coupled stochastic recursions:

x^(j)_{n+1} = x^(j)_n + α^(j)_n (h^(j)(x^(1)_n, x^(2)_n, ..., x^(N)_n) + M^(j)_{n+1})

where j = 1, 2, ..., N, requiring:

Stability: sup_n ||x^(j)_n|| < ∞ a.s. ∀j
Convergence: x^(j)n → x^(j)* a.s. ∀j

Core Theoretical Framework

Basic Assumptions

(A:1) h^(j) is Lipschitz continuous
(A:2) {M^(j)_{n+1}} is a martingale difference sequence with bounded conditional expectations
(A:3) Step-size sequences satisfy:

α^(j)_n > 0, Σα^(j)_n = ∞
Σ(α^(j)_n)² < ∞
α^(j)_n/α^(j-1)_n → 0 (timescale separation)

Stability Condition (B.N.i)

For scaled functions h^(i)_c(x^(i), x^(i+1), ..., x^(N)) = h^(i)(cy^(1), cy^(2), ..., cy^(N))/c, require:

h^(i)c → h^(i)∞ uniformly
Limiting ODE ẋ^(i)(t) = h^(i)_∞(x^(i)(t), x^(i+1), ..., x^(N)) possesses unique globally asymptotically stable equilibrium
Equilibrium point map λ^(i)_∞ is Lipschitz continuous

Convergence Condition (C.N.i)

Global asymptotic stability of the original ODE system under hierarchical fixed point structure.

Main Theorem

Theorem 1: Under assumptions (A:1)-(A:3), (B.N.i), and (C.N.i), the N-timescale iteration converges to the hierarchical fixed point:

x^(j)_n → x^(j)_* = λ^(j:N-1)(x^(N)_*)

Proof Strategy

Stability-Convergence Decomposition: First prove convergence assuming boundedness, then establish stability
Timescale Cascade: Analyze behavior starting from the fastest timescale, progressively through each scale
Recursive Scaling Argument: Use scaled iteration comparison with original iteration to prove boundedness

Experimental Setup

Application Scenarios

1. GTD Learning with Momentum

Datasets: 5-State Random Walk, 7-state Boyan Chain
Algorithms: GTD2-M-3TS, TDC-M-3TS (three-timescale), GTD2-M-4TS, TDC-M-4TS (four-timescale)
Parameter Settings:
- 5-State RW: α=0.4, β=0.4, ϱ^(1)=0.5, ϱ^(2)=0.25
- Boyan Chain: α=0.35, β=0.35, ϱ^(1)=0.45, ϱ^(2)=0.35

2. Off-Policy Actor-Critic

Setup: Gibbs policy parameterization, importance sampling ratios
Update Rules: Critic (fastest), actor (intermediate), baseline (slowest) timescales

3. Constrained Actor-Critic

Problem: Average reward constrained optimization
Timescales: Critic (fastest), actor (intermediate), dual variable (slowest)

Evaluation Metrics

GTD: Root Mean Square Projected Bellman Error (RMSPBE)
Actor-Critic: Policy performance and convergence
Theoretical Verification: Almost sure convergence proof

Experimental Results

Main Findings

GTD Momentum Experiments

Performance Improvement: Momentum versions outperform vanilla counterparts on both MDPs
Convergence Verification: All algorithms converge to theoretically predicted fixed point θ* = -Ā^(-1)b̄
Timescale Comparison:
- GTD2: 4-TS scheme performs better
- TDC: 3-TS version performs better

Theoretical Verification

Stability: All three applications satisfy assumptions (B.N.i) and (C.N.i)
Convergence: Proved almost sure convergence to expected hierarchical fixed points
Projection-Free: Successfully eliminated projection operations in actor updates

Technical Findings

Momentum Effect: Heavy-ball momentum significantly improves empirical convergence speed of GTD algorithms
Framework Generality: Single theoretical framework successfully handles momentum acceleration, off-policy learning, and primal-dual methods
Practical Value: Provides practical tools for verifying convergence of complex multi-timescale algorithms

Theoretical Foundations

Single-Timescale SA: Borkar-Meyn (2000) ODE method and stability conditions
Two-Timescale SA: Borkar (1997) convergence, Lakshminarayanan-Bhatnagar (2017) stability
Reinforcement Learning Applications: Actor-critic algorithms, GTD methods, constrained MDPs

Advantages of This Work

Theoretical Completeness: First complete stability and convergence theory for N-timescale case
Practicality: Verifiable conditions applicable to actual algorithm design
Breadth of Applications: Covers momentum methods, off-policy learning, constrained optimization and other important domains

Conclusions and Discussion

Main Conclusions

Successfully establishes the first complete theoretical framework for N-timescale SA
Proves stability and convergence for three important reinforcement learning algorithm classes
Demonstrates theoretical feasibility of momentum techniques in temporal-difference learning

Limitations

Markovian Noise: Current framework limited to martingale difference noise, not addressing more general Markovian noise
Step-Size Requirements: Theoretical analysis requires square-summable step-sizes, though experiments show non-square-summable step-sizes also work
Finite-Time Analysis: Lacks quantitative convergence rate analysis

Future Directions

Extension to Markovian Noise: Generalize Ramaswamy-Bhatnagar single-timescale results
Set-Valued Mappings: Handle RL algorithms under partial observation/information
Convergence Rates: Develop weak convergence rate theory for N-timescale case
Finite-Time Behavior: Quantify finite-time performance gains of momentum algorithms

In-Depth Evaluation

Strengths

Theoretical Breakthrough: Fills important gap in multi-timescale SA theory with landmark significance
Rigorous Methodology: Sophisticated proof techniques using innovative recursive scaling arguments
Application Value: Three application scenarios possess important practical significance, demonstrating framework generality
Clear Presentation: Well-structured exposition progressing from 3-timescale to N-timescale for accessibility

Weaknesses

Assumption Limitations: i.i.d. sampling assumption is restrictive in practical RL contexts
Experimental Scale: Experiments relatively simple, lacking verification in large-scale complex environments
Computational Complexity: Does not discuss computational complexity of verifying theoretical conditions
Practical Guidance: Lacks concrete guidance on selecting timescale separation parameters

Impact

Theoretical Contribution: Provides solid theoretical foundation for multi-timescale algorithm design
Practical Value: Enables convergence analysis of complex RL algorithms
Inspirational: Likely to inspire further research in multi-timescale algorithms
Reproducibility: Theoretical results are verifiable with clear experimental setup

Applicable Scenarios

Constrained Reinforcement Learning: Scenarios requiring primal-dual updates
Hierarchical Reinforcement Learning: Multi-level decision-making requiring different timescales
Momentum Acceleration Methods: Provides theoretical support for momentum techniques in RL
Algorithm Design: Offers convergence verification tool for new multi-timescale algorithms

References

This paper primarily builds upon the following important works:

Borkar, V.S. (2008). Stochastic Approximation: A Dynamical Systems Viewpoint
Lakshminarayanan, C. & Bhatnagar, S. (2017). A stability criterion for two-timescale stochastic approximation schemes
Sutton, R. et al. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation

Overall Assessment: This is a theoretically significant paper that successfully resolves stability and convergence issues in multi-timescale stochastic approximation, providing powerful theoretical tools for analyzing complex algorithms in reinforcement learning and related fields. While there remains room for improvement in practical applicability assumptions, its theoretical contributions and methodological innovations have lasting impact.