2025-11-17T07:58:12.711519

Posterior Sampling for Continuing Environments

Xu, Dong, Van Roy

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $Î³$-discounted return in that model. At each time, with probability $1-Î³$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(ÏS \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $Ï$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.

academic

Posterior Sampling for Continuing Environments

Basic Information

Paper ID: 2211.15931
Title: Posterior Sampling for Continuing Environments
Authors: Wanqiao Xu (Stanford University), Shi Dong (Google DeepMind), Benjamin Van Roy (Stanford University)
Classification: cs.LG stat.ML
Conference: RLJ | RLC 2024
Paper Link: https://arxiv.org/abs/2211.15931

Abstract

This paper proposes a posterior sampling reinforcement learning algorithm for continuing environments (Continuing PSRL) that naturally integrates into scalable agent designs. The algorithm maintains a statistically sound model of the environment and follows a policy that maximizes γ-discounted returns within this model. At each time step, the algorithm resamples the environment model from its posterior distribution with probability 1-γ. By appropriately selecting a discount factor dependent on the time horizon T, a Bayesian regret bound of Õ(τS√AT) is established, where S is the number of environment states, A is the number of actions, and τ denotes the reward averaging time.

Research Background and Motivation

Core Problem

Existing posterior sampling reinforcement learning algorithms are primarily designed for episodic environments and rely on maintaining state-action visitation counts, making them unsuitable for complex continuing environments with high-dimensional state spaces.

Problem Significance

Learning in continuing environments is a fundamental problem in reinforcement learning, yet existing randomized exploration methods are largely limited to episodic settings
Scalability requirements: Traditional methods depend on state-action visitation counts, which are infeasible in complex environments
Theoretical gap: Lack of rigorous theoretical analysis for continuing environments

Limitations of Existing Methods

TSDE (Ouyang et al., 2017): Requires complex resampling criteria including visitation count doubling conditions, infeasible in large state spaces
DS-PSRL (Theocharous et al., 2018): While avoiding visitation counts, its analysis relies on strong technical assumptions; regret grows linearly without these assumptions
Traditional PSRL: Only applicable to episodic environments and cannot be directly extended to continuing settings

Research Motivation

Propose a simple, scalable, and theoretically rigorous posterior sampling algorithm for continuing environments that can:

Avoid maintaining state-action visitation counts
Naturally integrate into existing function approximation methods
Provide theoretical guarantees matching state-of-the-art methods

Core Contributions

First scalable continuing PSRL algorithm: Proposes Continuing PSRL based on a simple randomization scheme, avoiding complex resampling criteria
Rigorous theoretical analysis: Establishes a Bayesian regret bound of Õ(τS√AT), matching existing best results
Scalability breakthrough: The algorithm naturally extends to high-dimensional state spaces and function approximation settings
New perspective on discount factors: Reinterprets the discount factor as an algorithm design tool rather than an environment property, providing new insights into its role

Method Details

Task Definition

Consider a Markov Decision Process modeled by an unknown environment E = (A,S,ρ), where:

A is a finite action space with |A| = A
S is a finite state space with |S| = S
ρ is the state transition probability function
The reward function r : S × A → 0,1 is deterministic and known

The objective is to minimize cumulative regret: $\text{Regret}(T,π) := \sum_{t=0}^{T-1}(λ_{*,E} - R_{t+1})$

where λ_{*,E} is the optimal average reward.

Model Architecture

Pseudo-Episode Construction

The algorithm decomposes the infinite time horizon learning problem into pseudo-episodes of random length:

At each time step t, sample a binary indicator X_t
When X_t = 0, initiate a new pseudo-episode and resample the environment model
When X_t = 1, continue the current pseudo-episode

Discounted Value Function

For environment E and policy π, the γ-discounted value function is defined as: $V^γ_{π,E} := \mathbb{E}\left[\sum_{h=0}^{H-1} P^h_π r_π | E\right] = \mathbb{E}\left[\sum_{h=0}^{∞} γ^h P^h_π r_π | E\right]$

where H is the pseudo-episode length, following a geometric distribution.

Reward Averaging Time

A key concept is the reward averaging time τ_{π,E}, defined as the minimum value τ such that: $\left|\mathbb{E}_π\left[\sum_{t=0}^{T-1} R_{t+1} | E, S_0 = s\right] - T \cdot λ_{π,E}(s)\right| \leq τ$

Algorithm Procedure

Algorithm 1: Continuing PSRL

Input: Prior distribution f, discount factor γ, total learning time T
1. Initialize t=1, k=1, X₁=0
2. for t ≤ T:
3.   if Xₜ = 0:
4.     tₖ ← t
5.     Sample Eₖ ~ f(·|H_tₖ)
6.     Compute πₖ = π^γ_Eₖ
7.     k ← k+1
8.   Sample and execute Aₜ ~ πₖ(·|Sₜ)
9.   Observe Rₜ₊₁ and Sₜ₊₁
10.  t ← t+1
11.  Sample Xₜ₊₁ ~ Bernoulli(γ)

Technical Innovations

Simple resampling mechanism: Uses only a Bernoulli random number generator, avoiding complex visitation count conditions
Connection between discount factor and resampling probability: Set γ = 1-p, where p is the resampling probability
Policy-independent resampling: Resampling criterion is independent of the policy, simplifying the analysis
Time-varying discount factor: Allows the discount factor to grow over time, achieving sublinear regret

Experimental Setup

Datasets

Tabular RiverSwim environment:
- Chain structure with 6 states
- Left-end state reward 0.005, right-end state reward 1.0
- Optimal policy is to always swim right
Continuous feature RiverSwim environment:
- Similar structure but using pixel feature observations
- Feature mapping: φ(s_t) = 1{x ≤ s_t} ∈ 0,1^N
- Tests algorithm performance under function approximation

Evaluation Metrics

Cumulative regret
Average regret over time

Comparison Methods

TSDE (Ouyang et al., 2017): Thompson sampling based on visitation counts
DS-PSRL (Theocharous et al., 2018): Fixed-interval resampling scheme
Random agent: As baseline
DQN with ε-greedy: Comparison in continuous feature environments

Implementation Details

Prior distributions: Dirichlet distribution (transitions) and Normal-Gamma distribution (rewards)
Hyperparameters: Pseudo-count n=1, α=1/S, μ=σ²=1
Bootstrapped DQN used in continuous environments with γ=0.99

Experimental Results

Main Results

Tabular environment:
- Continuing PSRL performance comparable to TSDE, despite the latter directly optimizing average reward
- Significantly outperforms DS-PSRL
- Validates theoretically predicted sublinear regret growth
Continuous feature environment:
- Bootstrapped DQN with random resampling achieves sublinear regret
- Clearly outperforms vanilla DQN with ε-greedy exploration
- Demonstrates scalability in complex environments

Experimental Findings

Effectiveness of simple resampling: Despite the simplicity of the resampling mechanism, performance is comparable to complex methods
Scalability advantages: In high-dimensional state spaces, traditional count-dependent methods fail while this method remains effective
Theory-practice consistency: Experimental results validate the correctness of theoretical analysis

Thompson Sampling and Posterior Sampling

Classical Thompson sampling: Originally developed for multi-armed bandit problems
PSRL extensions: Osband et al. extended it to reinforcement learning, primarily for episodic environments
Bootstrapped DQN: Uses ensemble methods to approximate posterior distributions

Exploration in Continuing Environments

TSDE: First Thompson sampling method for continuing environments, but relies on complex resampling criteria
DS-PSRL: Simplified resampling but requires strong technical assumptions
This work: First to provide simple and rigorous random resampling method

Theoretical Analysis

Existing work mostly considers γ-discounted regret; this paper focuses on undiscounted regret
Use of reward averaging time τ replaces the traditional span concept
Weakly communicating MDP assumption is the most general setting for finite-time regret bounds

Conclusions and Discussion

Main Conclusions

Theoretical contribution: Establishes a regret bound of Õ(τS√AT), matching state-of-the-art results
Algorithm simplicity: Requires only a Bernoulli random number generator for effective exploration
Practical value: Algorithm can be directly integrated into existing deep reinforcement learning methods
New perspective on discount factors: Reinterprets discount factors as algorithm design tools rather than environment properties

Limitations

Theoretical assumptions: Requires assumptions of weakly communicating MDPs and bounded reward averaging time
Prior dependence: Performance depends on reasonable prior distribution settings
Parameter tuning: Selection of discount factor γ requires knowledge of time horizon T
Experimental scope: Experiments primarily conducted in relatively simple environments

Future Directions

Prior-free settings: Investigate adaptive methods not requiring prior knowledge of T
More complex environments: Validate the method in larger-scale and more complex environments
Theoretical improvements: Relax assumptions such as weak communicability
Practical applications: Test algorithm performance in real-world application scenarios

In-Depth Evaluation

Strengths

Theoretical rigor: Provides complete theoretical analysis and proofs, filling the theoretical gap for continuing environment PSRL
Algorithm simplicity: Compared to existing methods, the resampling mechanism is extremely simple and easy to implement and understand
Scalability: Naturally supports function approximation and high-dimensional state spaces with strong practical value
Novel perspective: Reinterprets discount factors as algorithm design tools, providing new theoretical insights

Weaknesses

Insufficient experimental depth: Experiments primarily conducted in simple environments, lacking validation in large-scale complex environments
Parameter sensitivity: Selection of discount factor γ depends on problem parameters, potentially requiring careful tuning in practical applications
Incomplete comparisons: Lacks comparison with some related exploration methods (e.g., UCB-type methods)
Lack of real-world applications: Primarily theoretical and simple simulations, lacking validation in real-world scenarios

Impact

Theoretical contribution: Provides a new theoretical framework for exploration in continuing environments
Practical value: Algorithm simplicity makes it easy to adopt and extend
Inspirational significance: New interpretation of discount factors may influence future algorithm design
Reproducibility: Clear algorithm description and complete theoretical analysis ensure good reproducibility

Applicable Scenarios

Continuing reinforcement learning: Environments requiring long-term interaction without natural episode structure
High-dimensional state spaces: Complex environments where traditional count-based methods are inapplicable
Online learning: Scenarios requiring continuous learning and adaptation during interaction
Deep reinforcement learning: Can be integrated into existing deep RL frameworks

References

The paper cites important works in reinforcement learning, including:

Classical work on Thompson sampling (Thompson, 1933)
Pioneering work on PSRL (Osband et al., 2013)
Related research on continuing environments (Ouyang et al., 2017; Theocharous et al., 2018)
Important advances in deep reinforcement learning (Mnih et al., 2015)

Overall Assessment: This is a high-quality theoretical reinforcement learning paper that makes important contributions to posterior sampling methods in continuing environments. The algorithm design is elegant and simple, the theoretical analysis is rigorous and complete, and it provides new perspectives and tools for the field. While there is room for improvement in experimental validation, its theoretical value and practical potential are both outstanding.