2025-11-15T23:58:12.055440

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Liu, Wei, Zimmert

We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

academic

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Basic Information

Paper ID: 2510.08882
Title: An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
Authors: Haolin Liu (University of Virginia), Chen-Yu Wei (University of Virginia), Julian Zimmert (Google Research)
Classification: cs.LG (Machine Learning)
Publication Date: October 2025
Paper Link: https://arxiv.org/abs/2510.08882v1

Abstract

This paper investigates the structured observation decision-making problem (DMSO). Prior work characterized the complexity of DMSO through the decision-estimation coefficient (DEC), but left a gap between upper and lower regret bounds related to the model class size. Foster et al. (2023b) introduced optimistic DEC to narrow this gap, achieving bounds that depend only on the value function class size. However, it remains unclear whether exploration based on optimism can be extended to adversarial environments.

This paper proposes Dig-DEC, a model-free DEC method that removes optimism and drives exploration purely through information gain. Dig-DEC is always no larger than optimistic DEC and can be significantly smaller in special cases. Importantly, removing optimism enables it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, the paper obtains the first model-free regret bounds for hybrid MDPs with bandit feedback under various general transition structures.

Research Background and Motivation

Problem to be Solved: Existing decision-estimation coefficient (DEC) frameworks exhibit gaps between model class size and value function class size, and optimism-based methods cannot effectively handle adversarial environments.
Problem Significance:
- Online decision-making is a core problem in reinforcement learning
- Practical applications often face hybrid environments that are partially stochastic and partially adversarial
- Existing methods have gaps between theoretical guarantees and practical performance
Limitations of Existing Methods:
- Foster et al.'s model based on DEC/E2D incurs a log|M| model estimation cost
- While optimistic DEC improves complexity, it relies on the optimism principle and cannot handle adversarial settings
- Liu et al. (2025)'s hybrid MDP method only handles full-information feedback; the bandit case remains open
Research Motivation: Develop a unified framework that improves existing results in stochastic environments while simultaneously addressing the bandit feedback case for hybrid MDPs for the first time.

Core Contributions

Proposes Dig-DEC Complexity Measure: Introduces dual information gain decision-estimation coefficient that removes optimism and drives exploration purely through information gain
Unified Theoretical Framework: Constructs a general algorithmic framework capable of handling both stochastic and hybrid environments simultaneously
Improved Online Function Estimation:
- Average estimation error: improved from T^{3/4}/T^{5/6} to T^{2/3}/T^{7/9}
- Squared error: improved from T^{2/3} to √T, achieving the same performance as optimistic methods in Bellman-complete MDPs for the first time
Resolves Open Problem: Provides the first model-free regret bounds for hybrid MDPs under bandit feedback

Methodology Details

Task Definition

DMSO Framework: Given model space M, policy space Π, observation space O, and value function V. In each round t:

Environment selects model Mt ∈ M
Learner selects policy πt ∈ Π
Observation ot ~ Mt(·|πt)
Objective: minimize regret Reg(π*) = Σt(VMt(π*) - VMt(πt))

Φ-Restricted Environment: Partitions M×Π through information set Φ, where each information set ϕ contains a single policy πϕ.

Model Architecture

1. General Framework (Algorithm 1)

The core idea solves the following saddle-point problem:

min_{p∈Δ(Π)} max_{ν∈Δ(Ψ)} AIR^{Φ,D}_η(p,ν;ρt)

where the divergence measure is:

D^π(ν||ρ) = E_{M~ν}E_{o~M(·|π)}[KL(ν_{ϕ}(·|π,o), ρ) + E_{ϕ~ρ}[D^π(ϕ||M)]]

2. Dig-DEC Definition

dig-dec^{Φ,D}_η = max_{ρ∈Δ(Φ)} min_{p∈Δ(Π)} max_{ν∈Δ(Ψ)} 
E_{π~p}E_{(M,π*)~ν}[V_M(π*) - V_M(π) - (1/η)E_{o~M(·|π)}[KL(ν_{ϕ}(·|π,o), ρ)] - (1/η)E_{ϕ~ρ}[D^π(ϕ||M)]]

3. Posterior Update Mechanism

Depending on the choice of D:

Average Estimation Error: Uses batch algorithm (Algorithm 2)
Squared Estimation Error: Uses two-timescale learning algorithm (Algorithm 3)

Technical Innovations

Dual Information Gain Design:
- KL term for regularization, avoiding optimistic mechanisms
- D^π term captures distributional divergence, achieving strict improvements
Removal of Optimism: Replaces the V_ϕ(π_ϕ) term in optimistic DEC with regularization term KL(ν_{ϕ}, ρ)
Improved Estimation Procedures:
- Average error: Uses unbiased estimators replacing biased ones
- Squared error: Redesigns two-timescale procedure, improving Est bound from T^{1/3} to constant

Experimental Setup

Theoretical Validation Environments

Stochastic Settings:
- Bilinear class MDPs
- MDPs with bounded Bellman-eluder dimension
- MDPs with bounded coverability
Hybrid Settings:
- Hybrid bilinear classes
- Hybrid coverability MDPs
- Known linear reward features

Evaluation Metrics

Regret Bounds: Upper bounds on EReg(π*)
Complexity Comparison: dig-dec vs o-dec vs classical DEC
Convergence Rate: Power-law dependence on T

Comparison Methods

Foster et al. (2023b)'s optimistic DEC
Liu et al. (2025)'s AIR framework
Classical optimistic methods (Jin et al. 2021, Xie et al. 2023)

Experimental Results

Main Results

Improvements in Stochastic Environments (Table 1):

Setting Category	Foster et al. (2023b)	This Work
Bilinear/BE (Average Error)	T^{3/4}/T^{5/6}	T^{2/3}/T^{7/9}
Bellman-Complete (Squared Error)	T^{2/3}	√T

Breakthrough in Hybrid Environments (Table 2):

Setting Category	Liu et al. (2025)	This Work
Bilinear (Average Error)	Full-info only	T^{5/6}/T^{13/15}
Bellman-Complete (Squared Error)	Not applicable	T^{3/4}

Theoretical Advantage Verification

Theorem 13: In stochastic settings, dig-dec^{Φ,D}_η ≤ o-dec^{Φ,D}_η + η

Theorem 14: Constructs a three-arm bandit instance proving that in some cases:

Foster et al.'s method: EReg(a) ≥ Ω(√T)
This work: EReg(a) ≤ 1

Experimental Findings

Importance of Information Gain: KL term captures distributional information overlooked by optimistic DEC
Improvement in Estimation Procedures: Unbiased estimators significantly enhance concentration inequality effects
Advantages of Removing Optimism: Enables natural extension of algorithms to adversarial environments

Main Research Directions

DEC Framework Development:
- Foster et al. (2021b, 2023a): Establish foundational DEC theory
- Foster et al. (2023b): Introduce optimistic DEC
- Chen et al. (2025): Extend to other settings
Adversarial MDP Research:
- Neu et al. (2013): Tabular adversarial MDPs
- Luo et al. (2021): Linear adversarial MDPs
- Liu et al. (2025): Hybrid MDP theory
Model-Free Reinforcement Learning:
- Jin et al. (2021): Bellman-eluder dimension
- Xie et al. (2023): Coverability theory

Advantages of This Work

Theoretical Unification: First DEC framework simultaneously handling stochastic and hybrid environments
Technical Innovation: Dual information gain design, removing dependence on optimism
Problem Resolution: Addresses open problems left by Liu et al. (2025)

Conclusions and Discussion

Main Conclusions

Dig-DEC provides more precise complexity characterization, achieving arbitrarily large improvements in special cases
Removing optimism enables algorithms to naturally handle adversarial environments
Improved estimation procedures have important implications both theoretically and practically

Limitations

Assumption Constraints: Hybrid settings require known linear reward features (Assumption 4)
Technical Constraints: Certain low-rank MDP cases remain unaddressed
Computational Complexity: Practical computational efficiency of saddle-point optimization not discussed in detail

Future Directions

Relax constraints of Assumptions 3 and 4
Investigate fundamental limits of model-free learning
Develop more efficient computational algorithms

In-Depth Evaluation

Strengths

Significant Theoretical Contributions:
- Proposes new complexity measure dig-dec
- Unifies treatment of stochastic and adversarial environments
- Resolves important open problems
Strong Technical Innovation:
- Dual information gain design is elegant
- Estimation procedure improvements are effective
- Analysis techniques are advanced
Convincing Results:
- Theoretical bounds are tight and practical
- Constructed instances prove strict improvements
- Covers multiple classical MDP classes

Weaknesses

Limited Experimental Validation: Primarily theoretical analysis with lack of actual algorithm implementation and empirical verification
Restricted Applicability: Strong assumptions on hybrid MDPs limit generality of the method
Computational Complexity: Practical solvability and efficiency of saddle-point optimization requires further investigation

Impact

Theoretical Value: Provides new direction for DEC theory development, potentially inspiring subsequent research
Practical Potential: Offers theoretical guidance for practical reinforcement learning algorithm design
Field Advancement: Achieves breakthrough at the intersection of model-free learning and adversarial MDPs

Applicable Scenarios

Theoretical Research: DEC framework extension, complexity analysis
Algorithm Design: Reinforcement learning algorithms requiring handling of hybrid environments
Application Domains: Financial trading, recommendation systems, and other partially adversarial environments

References

Key references include:

Foster et al. (2021b, 2023a, 2023b): DEC theoretical foundations
Liu et al. (2025): Hybrid MDP research
Jin et al. (2021): Bellman-eluder dimension
Xie et al. (2023): Coverability theory
Xu and Zeevi (2023): AIR framework

This paper makes important progress in decision-estimation coefficient theory, solving key problems in the field through clever technical innovations and making valuable contributions to reinforcement learning theory development. While there is room for improvement in practical application validation, its theoretical value and innovation make it an important work in the field.