We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At each time, with probability $1-γ$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(ÏS \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $Ï$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
This paper proposes a posterior sampling reinforcement learning algorithm for continuing environments (Continuing PSRL) that naturally integrates into scalable agent designs. The algorithm maintains a statistically sound model of the environment and follows a policy that maximizes γ-discounted returns within this model. At each time step, the algorithm resamples the environment model from its posterior distribution with probability 1-γ. By appropriately selecting a discount factor dependent on the time horizon T, a Bayesian regret bound of Õ(τS√AT) is established, where S is the number of environment states, A is the number of actions, and τ denotes the reward averaging time.
Existing posterior sampling reinforcement learning algorithms are primarily designed for episodic environments and rely on maintaining state-action visitation counts, making them unsuitable for complex continuing environments with high-dimensional state spaces.
Learning in continuing environments is a fundamental problem in reinforcement learning, yet existing randomized exploration methods are largely limited to episodic settings
Scalability requirements: Traditional methods depend on state-action visitation counts, which are infeasible in complex environments
Theoretical gap: Lack of rigorous theoretical analysis for continuing environments
TSDE (Ouyang et al., 2017): Requires complex resampling criteria including visitation count doubling conditions, infeasible in large state spaces
DS-PSRL (Theocharous et al., 2018): While avoiding visitation counts, its analysis relies on strong technical assumptions; regret grows linearly without these assumptions
Traditional PSRL: Only applicable to episodic environments and cannot be directly extended to continuing settings
First scalable continuing PSRL algorithm: Proposes Continuing PSRL based on a simple randomization scheme, avoiding complex resampling criteria
Rigorous theoretical analysis: Establishes a Bayesian regret bound of Õ(τS√AT), matching existing best results
Scalability breakthrough: The algorithm naturally extends to high-dimensional state spaces and function approximation settings
New perspective on discount factors: Reinterprets the discount factor as an algorithm design tool rather than an environment property, providing new insights into its role
The paper cites important works in reinforcement learning, including:
Classical work on Thompson sampling (Thompson, 1933)
Pioneering work on PSRL (Osband et al., 2013)
Related research on continuing environments (Ouyang et al., 2017; Theocharous et al., 2018)
Important advances in deep reinforcement learning (Mnih et al., 2015)
Overall Assessment: This is a high-quality theoretical reinforcement learning paper that makes important contributions to posterior sampling methods in continuing environments. The algorithm design is elegant and simple, the theoretical analysis is rigorous and complete, and it provides new perspectives and tools for the field. While there is room for improvement in experimental validation, its theoretical value and practical potential are both outstanding.