2025-11-20T13:28:14.804433

Towards Blackwell Optimality: Bellman Optimality Is All You Can Get

Boone, Tuynman

Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.

academic

Towards Blackwell Optimality: Bellman Optimality Is All You Can Get

Basic Information

Paper ID: 2510.13476
Title: Towards Blackwell Optimality: Bellman Optimality Is All You Can Get
Authors: Victor Boone (Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG & IRIT, Université de Toulouse), Adrienne Tuynman (Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189-CRIStAL)
Classification: cs.LG (Machine Learning)
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13476v1

Abstract

While average gain optimality is a commonly used performance measure in Markov Decision Processes (MDPs), it is often overly asymptotic. Further incorporating measures of immediate loss leads to a hierarchy of bias optimality criteria, extending all the way to Blackwell optimality. This paper investigates the problem of identifying policies of such optimality orders. To this end, for each order, we construct a learning algorithm with vanishing error probability. Furthermore, we characterize the class of MDPs for which the identification algorithm can terminate in finite time. This class corresponds to MDPs with a unique Bellman optimal policy and does not depend on the optimality order considered. Finally, we provide a tractable stopping rule that, when coupled with our learning algorithm, triggers in finite time whenever possible.

Research Background and Motivation

Problem Definition

The core problem investigated in this paper is the identifiability problem of learning higher-order optimal policies in Markov Decision Processes. Specifically:

Limitations of Traditional Average Gain Optimality: In MDPs, traditional average gain optimality focuses only on long-term asymptotic performance, neglecting the importance of short-term immediate rewards.
Hierarchy of Higher-Order Optimality: From gain optimality to bias optimality, and further to Blackwell optimality, a complete hierarchy of optimality criteria is formed, with each level considering more refined performance distinctions.
Learning Challenge: The paper demonstrates through a simple yet profound example (Figure 1) that learning higher-order optimal policies faces fundamental difficulties even in the simplest cases.

Research Motivation

The core motivation of the paper stems from an important observation: even in simple MDPs with only a single unknown parameter, learning higher-order optimal policies may be impossible. This seemingly paradoxical phenomenon reveals the fundamental difficulty of higher-order optimality learning.

Limitations of Existing Approaches

Failure of Standard Learning Methods: Traditional empirical optimal policy selection no longer applies under higher-order optimality
Limitations of Statistical Testing: Inability to determine the exact value of parameters (e.g., θ=0) through statistical testing
Discontinuity Issues: The discontinuity of the optimal policy set in parameter space leads to learning difficulties

Core Contributions

Construction of Consistent Algorithm HOPE: For each optimality order n≥-1, proposes the Higher Order Policy iteration Epsilonized (HOPE) algorithm, which has vanishing error probability.
Complete Characterization of Non-Degenerate MDP Class: Proves that an MDP is non-degenerate (i.e., the identification algorithm can terminate in finite time) if and only if it has a unique Bellman optimal policy.
Degeneracy Collapse Theorem: Proves that the non-degenerate MDP classes for all optimality orders (from gain optimality to Blackwell optimality) are completely identical, a surprising result.
Provision of Computable Stopping Rule: Provides a tractable stopping rule that enables the HOPE algorithm to terminate in finite time whenever possible.

Methodology Details

Task Definition

Input: Unknown communicating MDP M = (Z, ν, p), where Z is the state-action pair space, ν is the reward function, and p is the transition kernel Output: n-th order optimal policy π ∈ Π*_n(M) Objective: Construct an identification algorithm such that the error probability of the recommended policy approaches zero

Core Algorithm Architecture

HOPE Algorithm (Algorithm 1)

Input: Desired optimality order n ≥ -1
Initialize: t ← 0
Loop:
    1. Construct empirical MDP M̂_t
    2. Set ε ← t^(-1/4)
    3. Compute ∏_s A_n(s) ← HOPI_{n,ε}(M̂_t)
    4. Recommend any policy in ∏_s A_n(s)
    5. Uniformly sample actions, observe rewards and next states
    6. t ← t + 1

Core Idea of HOPI Algorithm

HOPI is an "epsilonized" version of higher-order policy iteration, with key innovations including:

Soft Argmax Operation: Relaxes the exact Bellman equation constraint Δ^π_m(s,a) = 0 to Δ^π_m(s,a) ≤ ε
Two-Stage Policy Improvement:
- First stage: Seeks m+1-order improvement among actions known to be m-1 order optimal
- Second stage: Further refines to m+2-order optimality
Progressive Precision Control: Selects ε_t = t^(-1/4) to ensure algorithm consistency

Technical Innovations

1. Policy Iteration Under Noise

Traditional policy iteration requires exact satisfaction of Bellman equations, which is impossible in learning environments. HOPI enables the algorithm to work correctly under noisy conditions by introducing a relaxation parameter ε.

2. Shattering Technique

Proposition 5: For any single-chain Bellman optimal policy π and precision ε > 0, there exists M' ∈ M such that d_∞(M,M') < ε and π is the unique gain optimal policy of M'.

This technique is realized in two steps:

First, makes π the unique Bellman optimal policy by penalizing non-optimal actions
Then, makes π the unique gain optimal policy through ergodic transformation

3. Stopping Rule Design

Define threshold function:

β(M) := min{d_min(Δ*)/((1+4α)(2+sp(b*))), 1/α}

where α is the worst-case hitting time, providing a precise measure of neighborhood radius.

Theoretical Results

Main Theorems

Theorem 1 (Consistency)

For all n ≥ -1, the HOPE(n) algorithm is consistent for Π*_n.

Theorem 2 (Non-Degeneracy Characterization)

Let n ≥ -1. An MDP M is non-degenerate with respect to Π*_n(M) if and only if M has a unique Bellman optimal policy.

Corollary 3 (Degeneracy Collapse)

The sets of degenerate models with respect to Π_{-1}, Π0, Π_1, ..., Π∞ are completely identical.

Proof Strategy

Necessity of Non-Degeneracy (Theorem 4)

If M is non-degenerate, then there exists a policy π ∈ Π*(M) that remains optimal in a neighborhood of M. The proof uses continuity of algorithmic decisions.

Sufficiency (Theorems 10-11)

Proves that MDPs with unique Bellman optimal policies are indeed non-degenerate through construction of explicit stopping rules and confidence intervals.

Experimental Setup and Results

Theoretical Verification

The paper is primarily theoretical work, verifying all major results through rigorous mathematical proofs. Key verifications include:

Algorithm Correctness: Proves that HOPI returns the correct optimal policy set in the noise-free case
Consistency Guarantees: Proves that the error probability of the HOPE algorithm indeed approaches zero
Stopping Rule Validity: Proves that the proposed stopping rule indeed triggers in finite time

Complexity Analysis

Time Complexity: The time complexity of determining unique Bellman optimality is O(|Z| + |S|^4)
Sample Complexity: Although the paper does not provide exact sample complexity bounds, it proves that the algorithm must converge in the non-degenerate case

Optimal Arm Identification

Related to the optimal arm identification problem in multi-armed bandits, but state-dependency in the MDP setting makes the problem more complex.

Average Reward Reinforcement Learning

Recent work on identifying gain optimal policies under (ε,δ)-PAC settings, but these works primarily focus on approximate optimality rather than exact optimality.

Blackwell Optimality Computation

Work by Grand-Clément and Petrik (2023) and others studying computational complexity of Blackwell optimal policies based on historical definitions of discount factors.

Boone and Gaujal (2023) studied the learnability of Blackwell optimal policies in deterministic transition MDPs; this paper generalizes their results to the stochastic case.

Conclusions and Discussion

Main Conclusions

Bellman Optimality as Fundamental Limit: The learnability of all higher-order optimalities reduces to the uniqueness of Bellman optimality
Universality of Degeneracy: The degenerate MDP classes for different optimality orders are completely identical
Existence of Algorithms: Despite difficulties, consistent algorithms do exist

Limitations

Absence of PAC Setting: The paper primarily focuses on exact optimality, not addressing approximate optimality learning
Sample Complexity Bounds: No precise sample complexity analysis is provided
Practical Application: Theoretical results may limit applicability to real-world problems

Future Directions

PAC Framework Extension: Study learning of approximate optimal policies
Sample Complexity Analysis: Establish precise sample complexity bounds
Algorithm Optimization: Improve practical performance of the HOPE algorithm

In-Depth Evaluation

Strengths

Theoretical Depth: Provides complete theoretical characterization of the higher-order optimality learning problem
Surprising Results: The degeneracy collapse theorem is a surprising and profound result
Technical Innovation: The shattering technique and soft argmax policy iteration demonstrate technical innovation
Clear Writing: The paper is well-structured with rigorous mathematical proofs

Weaknesses

Limited Practical Applicability: Theoretical results suggest most practical MDPs may be degenerate
Computational Complexity: While polynomial-time algorithms are provided, constants may be large
Missing Experimental Validation: As purely theoretical work, lacks experimental verification

Impact

Theoretical Contribution: Provides important negative results for reinforcement learning theory
Methodological Inspiration: The shattering technique may have applications in other problems
Research Direction: Points to the importance of PAC settings for future research

Applicable Scenarios

Theoretical Research: Provides important reference for reinforcement learning theory research
Algorithm Design: Provides theoretical guidance for designing practical policy learning algorithms
Problem Analysis: Helps understand how MDP structure affects learning difficulty

References

The paper cites important literature in reinforcement learning and Markov Decision Processes, including:

Puterman (1994): Classical textbook on Markov Decision Processes
Blackwell (1962): Original definition of Blackwell optimality
Kaufmann et al. (2016): Complexity theory of optimal arm identification
Boone and Gaujal (2023): Learnability of Blackwell optimality in deterministic MDPs

Through rigorous theoretical analysis, this paper reveals a fundamental and profound phenomenon in reinforcement learning: the learning difficulty of higher-order optimality is completely determined by the structure of Bellman optimality. This result not only has important theoretical significance but also provides crucial guidance for practical algorithm design.