2025-11-17T07:58:12.711519

Posterior Sampling for Continuing Environments

Xu, Dong, Van Roy
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At each time, with probability $1-γ$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(τS \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $τ$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
academic

Posterior Sampling for Continuing Environments

Basic Information

  • Paper ID: 2211.15931
  • Title: Posterior Sampling for Continuing Environments
  • Authors: Wanqiao Xu (Stanford University), Shi Dong (Google DeepMind), Benjamin Van Roy (Stanford University)
  • Classification: cs.LG stat.ML
  • Conference: RLJ | RLC 2024
  • Paper Link: https://arxiv.org/abs/2211.15931

Abstract

This paper proposes a posterior sampling reinforcement learning algorithm for continuing environments (Continuing PSRL) that naturally integrates into scalable agent designs. The algorithm maintains a statistically sound model of the environment and follows a policy that maximizes γ-discounted returns within this model. At each time step, the algorithm resamples the environment model from its posterior distribution with probability 1-γ. By appropriately selecting a discount factor dependent on the time horizon T, a Bayesian regret bound of Õ(τS√AT) is established, where S is the number of environment states, A is the number of actions, and τ denotes the reward averaging time.

Research Background and Motivation

Core Problem

Existing posterior sampling reinforcement learning algorithms are primarily designed for episodic environments and rely on maintaining state-action visitation counts, making them unsuitable for complex continuing environments with high-dimensional state spaces.

Problem Significance

  1. Learning in continuing environments is a fundamental problem in reinforcement learning, yet existing randomized exploration methods are largely limited to episodic settings
  2. Scalability requirements: Traditional methods depend on state-action visitation counts, which are infeasible in complex environments
  3. Theoretical gap: Lack of rigorous theoretical analysis for continuing environments

Limitations of Existing Methods

  1. TSDE (Ouyang et al., 2017): Requires complex resampling criteria including visitation count doubling conditions, infeasible in large state spaces
  2. DS-PSRL (Theocharous et al., 2018): While avoiding visitation counts, its analysis relies on strong technical assumptions; regret grows linearly without these assumptions
  3. Traditional PSRL: Only applicable to episodic environments and cannot be directly extended to continuing settings

Research Motivation

Propose a simple, scalable, and theoretically rigorous posterior sampling algorithm for continuing environments that can:

  • Avoid maintaining state-action visitation counts
  • Naturally integrate into existing function approximation methods
  • Provide theoretical guarantees matching state-of-the-art methods

Core Contributions

  1. First scalable continuing PSRL algorithm: Proposes Continuing PSRL based on a simple randomization scheme, avoiding complex resampling criteria
  2. Rigorous theoretical analysis: Establishes a Bayesian regret bound of Õ(τS√AT), matching existing best results
  3. Scalability breakthrough: The algorithm naturally extends to high-dimensional state spaces and function approximation settings
  4. New perspective on discount factors: Reinterprets the discount factor as an algorithm design tool rather than an environment property, providing new insights into its role

Method Details

Task Definition

Consider a Markov Decision Process modeled by an unknown environment E = (A,S,ρ), where:

  • A is a finite action space with |A| = A
  • S is a finite state space with |S| = S
  • ρ is the state transition probability function
  • The reward function r : S × A → 0,1 is deterministic and known

The objective is to minimize cumulative regret: Regret(T,π):=t=0T1(λ,ERt+1)\text{Regret}(T,π) := \sum_{t=0}^{T-1}(λ_{*,E} - R_{t+1})

where λ_{*,E} is the optimal average reward.

Model Architecture

Pseudo-Episode Construction

The algorithm decomposes the infinite time horizon learning problem into pseudo-episodes of random length:

  • At each time step t, sample a binary indicator X_t
  • When X_t = 0, initiate a new pseudo-episode and resample the environment model
  • When X_t = 1, continue the current pseudo-episode

Discounted Value Function

For environment E and policy π, the γ-discounted value function is defined as: Vπ,Eγ:=E[h=0H1PπhrπE]=E[h=0γhPπhrπE]V^γ_{π,E} := \mathbb{E}\left[\sum_{h=0}^{H-1} P^h_π r_π | E\right] = \mathbb{E}\left[\sum_{h=0}^{∞} γ^h P^h_π r_π | E\right]

where H is the pseudo-episode length, following a geometric distribution.

Reward Averaging Time

A key concept is the reward averaging time τ_{π,E}, defined as the minimum value τ such that: Eπ[t=0T1Rt+1E,S0=s]Tλπ,E(s)τ\left|\mathbb{E}_π\left[\sum_{t=0}^{T-1} R_{t+1} | E, S_0 = s\right] - T \cdot λ_{π,E}(s)\right| \leq τ

Algorithm Procedure

Algorithm 1: Continuing PSRL

Input: Prior distribution f, discount factor γ, total learning time T
1. Initialize t=1, k=1, X₁=0
2. for t ≤ T:
3.   if Xₜ = 0:
4.     tₖ ← t
5.     Sample Eₖ ~ f(·|H_tₖ)
6.     Compute πₖ = π^γ_Eₖ
7.     k ← k+1
8.   Sample and execute Aₜ ~ πₖ(·|Sₜ)
9.   Observe Rₜ₊₁ and Sₜ₊₁
10.  t ← t+1
11.  Sample Xₜ₊₁ ~ Bernoulli(γ)

Technical Innovations

  1. Simple resampling mechanism: Uses only a Bernoulli random number generator, avoiding complex visitation count conditions
  2. Connection between discount factor and resampling probability: Set γ = 1-p, where p is the resampling probability
  3. Policy-independent resampling: Resampling criterion is independent of the policy, simplifying the analysis
  4. Time-varying discount factor: Allows the discount factor to grow over time, achieving sublinear regret

Experimental Setup

Datasets

  1. Tabular RiverSwim environment:
    • Chain structure with 6 states
    • Left-end state reward 0.005, right-end state reward 1.0
    • Optimal policy is to always swim right
  2. Continuous feature RiverSwim environment:
    • Similar structure but using pixel feature observations
    • Feature mapping: φ(s_t) = 1{x ≤ s_t} ∈ 0,1^N
    • Tests algorithm performance under function approximation

Evaluation Metrics

  • Cumulative regret
  • Average regret over time

Comparison Methods

  1. TSDE (Ouyang et al., 2017): Thompson sampling based on visitation counts
  2. DS-PSRL (Theocharous et al., 2018): Fixed-interval resampling scheme
  3. Random agent: As baseline
  4. DQN with ε-greedy: Comparison in continuous feature environments

Implementation Details

  • Prior distributions: Dirichlet distribution (transitions) and Normal-Gamma distribution (rewards)
  • Hyperparameters: Pseudo-count n=1, α=1/S, μ=σ²=1
  • Bootstrapped DQN used in continuous environments with γ=0.99

Experimental Results

Main Results

  1. Tabular environment:
    • Continuing PSRL performance comparable to TSDE, despite the latter directly optimizing average reward
    • Significantly outperforms DS-PSRL
    • Validates theoretically predicted sublinear regret growth
  2. Continuous feature environment:
    • Bootstrapped DQN with random resampling achieves sublinear regret
    • Clearly outperforms vanilla DQN with ε-greedy exploration
    • Demonstrates scalability in complex environments

Experimental Findings

  1. Effectiveness of simple resampling: Despite the simplicity of the resampling mechanism, performance is comparable to complex methods
  2. Scalability advantages: In high-dimensional state spaces, traditional count-dependent methods fail while this method remains effective
  3. Theory-practice consistency: Experimental results validate the correctness of theoretical analysis

Thompson Sampling and Posterior Sampling

  • Classical Thompson sampling: Originally developed for multi-armed bandit problems
  • PSRL extensions: Osband et al. extended it to reinforcement learning, primarily for episodic environments
  • Bootstrapped DQN: Uses ensemble methods to approximate posterior distributions

Exploration in Continuing Environments

  • TSDE: First Thompson sampling method for continuing environments, but relies on complex resampling criteria
  • DS-PSRL: Simplified resampling but requires strong technical assumptions
  • This work: First to provide simple and rigorous random resampling method

Theoretical Analysis

  • Existing work mostly considers γ-discounted regret; this paper focuses on undiscounted regret
  • Use of reward averaging time τ replaces the traditional span concept
  • Weakly communicating MDP assumption is the most general setting for finite-time regret bounds

Conclusions and Discussion

Main Conclusions

  1. Theoretical contribution: Establishes a regret bound of Õ(τS√AT), matching state-of-the-art results
  2. Algorithm simplicity: Requires only a Bernoulli random number generator for effective exploration
  3. Practical value: Algorithm can be directly integrated into existing deep reinforcement learning methods
  4. New perspective on discount factors: Reinterprets discount factors as algorithm design tools rather than environment properties

Limitations

  1. Theoretical assumptions: Requires assumptions of weakly communicating MDPs and bounded reward averaging time
  2. Prior dependence: Performance depends on reasonable prior distribution settings
  3. Parameter tuning: Selection of discount factor γ requires knowledge of time horizon T
  4. Experimental scope: Experiments primarily conducted in relatively simple environments

Future Directions

  1. Prior-free settings: Investigate adaptive methods not requiring prior knowledge of T
  2. More complex environments: Validate the method in larger-scale and more complex environments
  3. Theoretical improvements: Relax assumptions such as weak communicability
  4. Practical applications: Test algorithm performance in real-world application scenarios

In-Depth Evaluation

Strengths

  1. Theoretical rigor: Provides complete theoretical analysis and proofs, filling the theoretical gap for continuing environment PSRL
  2. Algorithm simplicity: Compared to existing methods, the resampling mechanism is extremely simple and easy to implement and understand
  3. Scalability: Naturally supports function approximation and high-dimensional state spaces with strong practical value
  4. Novel perspective: Reinterprets discount factors as algorithm design tools, providing new theoretical insights

Weaknesses

  1. Insufficient experimental depth: Experiments primarily conducted in simple environments, lacking validation in large-scale complex environments
  2. Parameter sensitivity: Selection of discount factor γ depends on problem parameters, potentially requiring careful tuning in practical applications
  3. Incomplete comparisons: Lacks comparison with some related exploration methods (e.g., UCB-type methods)
  4. Lack of real-world applications: Primarily theoretical and simple simulations, lacking validation in real-world scenarios

Impact

  1. Theoretical contribution: Provides a new theoretical framework for exploration in continuing environments
  2. Practical value: Algorithm simplicity makes it easy to adopt and extend
  3. Inspirational significance: New interpretation of discount factors may influence future algorithm design
  4. Reproducibility: Clear algorithm description and complete theoretical analysis ensure good reproducibility

Applicable Scenarios

  1. Continuing reinforcement learning: Environments requiring long-term interaction without natural episode structure
  2. High-dimensional state spaces: Complex environments where traditional count-based methods are inapplicable
  3. Online learning: Scenarios requiring continuous learning and adaptation during interaction
  4. Deep reinforcement learning: Can be integrated into existing deep RL frameworks

References

The paper cites important works in reinforcement learning, including:

  • Classical work on Thompson sampling (Thompson, 1933)
  • Pioneering work on PSRL (Osband et al., 2013)
  • Related research on continuing environments (Ouyang et al., 2017; Theocharous et al., 2018)
  • Important advances in deep reinforcement learning (Mnih et al., 2015)

Overall Assessment: This is a high-quality theoretical reinforcement learning paper that makes important contributions to posterior sampling methods in continuing environments. The algorithm design is elegant and simple, the theoretical analysis is rigorous and complete, and it provides new perspectives and tools for the field. While there is room for improvement in experimental validation, its theoretical value and practical potential are both outstanding.