Towards Blackwell Optimality: Bellman Optimality Is All You Can Get
Boone, Tuynman
Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.
academic
Towards Blackwell Optimality: Bellman Optimality Is All You Can Get
While average gain optimality is a commonly used performance measure in Markov Decision Processes (MDPs), it is often overly asymptotic. Further incorporating measures of immediate loss leads to a hierarchy of bias optimality criteria, extending all the way to Blackwell optimality. This paper investigates the problem of identifying policies of such optimality orders. To this end, for each order, we construct a learning algorithm with vanishing error probability. Furthermore, we characterize the class of MDPs for which the identification algorithm can terminate in finite time. This class corresponds to MDPs with a unique Bellman optimal policy and does not depend on the optimality order considered. Finally, we provide a tractable stopping rule that, when coupled with our learning algorithm, triggers in finite time whenever possible.
The core problem investigated in this paper is the identifiability problem of learning higher-order optimal policies in Markov Decision Processes. Specifically:
Limitations of Traditional Average Gain Optimality: In MDPs, traditional average gain optimality focuses only on long-term asymptotic performance, neglecting the importance of short-term immediate rewards.
Hierarchy of Higher-Order Optimality: From gain optimality to bias optimality, and further to Blackwell optimality, a complete hierarchy of optimality criteria is formed, with each level considering more refined performance distinctions.
Learning Challenge: The paper demonstrates through a simple yet profound example (Figure 1) that learning higher-order optimal policies faces fundamental difficulties even in the simplest cases.
The core motivation of the paper stems from an important observation: even in simple MDPs with only a single unknown parameter, learning higher-order optimal policies may be impossible. This seemingly paradoxical phenomenon reveals the fundamental difficulty of higher-order optimality learning.
Construction of Consistent Algorithm HOPE: For each optimality order n≥-1, proposes the Higher Order Policy iteration Epsilonized (HOPE) algorithm, which has vanishing error probability.
Complete Characterization of Non-Degenerate MDP Class: Proves that an MDP is non-degenerate (i.e., the identification algorithm can terminate in finite time) if and only if it has a unique Bellman optimal policy.
Degeneracy Collapse Theorem: Proves that the non-degenerate MDP classes for all optimality orders (from gain optimality to Blackwell optimality) are completely identical, a surprising result.
Provision of Computable Stopping Rule: Provides a tractable stopping rule that enables the HOPE algorithm to terminate in finite time whenever possible.
Input: Unknown communicating MDP M = (Z, ν, p), where Z is the state-action pair space, ν is the reward function, and p is the transition kernel
Output: n-th order optimal policy π ∈ Π*_n(M)
Objective: Construct an identification algorithm such that the error probability of the recommended policy approaches zero
Input: Desired optimality order n ≥ -1
Initialize: t ← 0
Loop:
1. Construct empirical MDP M̂_t
2. Set ε ← t^(-1/4)
3. Compute ∏_s A_n(s) ← HOPI_{n,ε}(M̂_t)
4. Recommend any policy in ∏_s A_n(s)
5. Uniformly sample actions, observe rewards and next states
6. t ← t + 1
Traditional policy iteration requires exact satisfaction of Bellman equations, which is impossible in learning environments. HOPI enables the algorithm to work correctly under noisy conditions by introducing a relaxation parameter ε.
Proposition 5: For any single-chain Bellman optimal policy π and precision ε > 0, there exists M' ∈ M such that d_∞(M,M') < ε and π is the unique gain optimal policy of M'.
This technique is realized in two steps:
First, makes π the unique Bellman optimal policy by penalizing non-optimal actions
Then, makes π the unique gain optimal policy through ergodic transformation
If M is non-degenerate, then there exists a policy π ∈ Π*(M) that remains optimal in a neighborhood of M. The proof uses continuity of algorithmic decisions.
Proves that MDPs with unique Bellman optimal policies are indeed non-degenerate through construction of explicit stopping rules and confidence intervals.
Time Complexity: The time complexity of determining unique Bellman optimality is O(|Z| + |S|^4)
Sample Complexity: Although the paper does not provide exact sample complexity bounds, it proves that the algorithm must converge in the non-degenerate case
Recent work on identifying gain optimal policies under (ε,δ)-PAC settings, but these works primarily focus on approximate optimality rather than exact optimality.
Work by Grand-Clément and Petrik (2023) and others studying computational complexity of Blackwell optimal policies based on historical definitions of discount factors.
Boone and Gaujal (2023) studied the learnability of Blackwell optimal policies in deterministic transition MDPs; this paper generalizes their results to the stochastic case.
The paper cites important literature in reinforcement learning and Markov Decision Processes, including:
Puterman (1994): Classical textbook on Markov Decision Processes
Blackwell (1962): Original definition of Blackwell optimality
Kaufmann et al. (2016): Complexity theory of optimal arm identification
Boone and Gaujal (2023): Learnability of Blackwell optimality in deterministic MDPs
Through rigorous theoretical analysis, this paper reveals a fundamental and profound phenomenon in reinforcement learning: the learning difficulty of higher-order optimality is completely determined by the structure of Bellman optimality. This result not only has important theoretical significance but also provides crucial guidance for practical algorithm design.