2025-11-15T23:58:12.055440

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Liu, Wei, Zimmert
We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.
academic

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Basic Information

  • Paper ID: 2510.08882
  • Title: An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
  • Authors: Haolin Liu (University of Virginia), Chen-Yu Wei (University of Virginia), Julian Zimmert (Google Research)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: October 2025
  • Paper Link: https://arxiv.org/abs/2510.08882v1

Abstract

This paper investigates the structured observation decision-making problem (DMSO). Prior work characterized the complexity of DMSO through the decision-estimation coefficient (DEC), but left a gap between upper and lower regret bounds related to the model class size. Foster et al. (2023b) introduced optimistic DEC to narrow this gap, achieving bounds that depend only on the value function class size. However, it remains unclear whether exploration based on optimism can be extended to adversarial environments.

This paper proposes Dig-DEC, a model-free DEC method that removes optimism and drives exploration purely through information gain. Dig-DEC is always no larger than optimistic DEC and can be significantly smaller in special cases. Importantly, removing optimism enables it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, the paper obtains the first model-free regret bounds for hybrid MDPs with bandit feedback under various general transition structures.

Research Background and Motivation

  1. Problem to be Solved: Existing decision-estimation coefficient (DEC) frameworks exhibit gaps between model class size and value function class size, and optimism-based methods cannot effectively handle adversarial environments.
  2. Problem Significance:
    • Online decision-making is a core problem in reinforcement learning
    • Practical applications often face hybrid environments that are partially stochastic and partially adversarial
    • Existing methods have gaps between theoretical guarantees and practical performance
  3. Limitations of Existing Methods:
    • Foster et al.'s model based on DEC/E2D incurs a log|M| model estimation cost
    • While optimistic DEC improves complexity, it relies on the optimism principle and cannot handle adversarial settings
    • Liu et al. (2025)'s hybrid MDP method only handles full-information feedback; the bandit case remains open
  4. Research Motivation: Develop a unified framework that improves existing results in stochastic environments while simultaneously addressing the bandit feedback case for hybrid MDPs for the first time.

Core Contributions

  1. Proposes Dig-DEC Complexity Measure: Introduces dual information gain decision-estimation coefficient that removes optimism and drives exploration purely through information gain
  2. Unified Theoretical Framework: Constructs a general algorithmic framework capable of handling both stochastic and hybrid environments simultaneously
  3. Improved Online Function Estimation:
    • Average estimation error: improved from T^{3/4}/T^{5/6} to T^{2/3}/T^{7/9}
    • Squared error: improved from T^{2/3} to √T, achieving the same performance as optimistic methods in Bellman-complete MDPs for the first time
  4. Resolves Open Problem: Provides the first model-free regret bounds for hybrid MDPs under bandit feedback

Methodology Details

Task Definition

DMSO Framework: Given model space M, policy space Π, observation space O, and value function V. In each round t:

  • Environment selects model Mt ∈ M
  • Learner selects policy πt ∈ Π
  • Observation ot ~ Mt(·|πt)
  • Objective: minimize regret Reg(π*) = Σt(VMt(π*) - VMt(πt))

Φ-Restricted Environment: Partitions M×Π through information set Φ, where each information set ϕ contains a single policy πϕ.

Model Architecture

1. General Framework (Algorithm 1)

The core idea solves the following saddle-point problem:

min_{p∈Δ(Π)} max_{ν∈Δ(Ψ)} AIR^{Φ,D}_η(p,ν;ρt)

where the divergence measure is:

D^π(ν||ρ) = E_{M~ν}E_{o~M(·|π)}[KL(ν_{ϕ}(·|π,o), ρ) + E_{ϕ~ρ}[D^π(ϕ||M)]]

2. Dig-DEC Definition

dig-dec^{Φ,D}_η = max_{ρ∈Δ(Φ)} min_{p∈Δ(Π)} max_{ν∈Δ(Ψ)} 
E_{π~p}E_{(M,π*)~ν}[V_M(π*) - V_M(π) - (1/η)E_{o~M(·|π)}[KL(ν_{ϕ}(·|π,o), ρ)] - (1/η)E_{ϕ~ρ}[D^π(ϕ||M)]]

3. Posterior Update Mechanism

Depending on the choice of D:

  • Average Estimation Error: Uses batch algorithm (Algorithm 2)
  • Squared Estimation Error: Uses two-timescale learning algorithm (Algorithm 3)

Technical Innovations

  1. Dual Information Gain Design:
    • KL term for regularization, avoiding optimistic mechanisms
    • D^π term captures distributional divergence, achieving strict improvements
  2. Removal of Optimism: Replaces the V_ϕ(π_ϕ) term in optimistic DEC with regularization term KL(ν_{ϕ}, ρ)
  3. Improved Estimation Procedures:
    • Average error: Uses unbiased estimators replacing biased ones
    • Squared error: Redesigns two-timescale procedure, improving Est bound from T^{1/3} to constant

Experimental Setup

Theoretical Validation Environments

  1. Stochastic Settings:
    • Bilinear class MDPs
    • MDPs with bounded Bellman-eluder dimension
    • MDPs with bounded coverability
  2. Hybrid Settings:
    • Hybrid bilinear classes
    • Hybrid coverability MDPs
    • Known linear reward features

Evaluation Metrics

  • Regret Bounds: Upper bounds on EReg(π*)
  • Complexity Comparison: dig-dec vs o-dec vs classical DEC
  • Convergence Rate: Power-law dependence on T

Comparison Methods

  • Foster et al. (2023b)'s optimistic DEC
  • Liu et al. (2025)'s AIR framework
  • Classical optimistic methods (Jin et al. 2021, Xie et al. 2023)

Experimental Results

Main Results

Improvements in Stochastic Environments (Table 1):

Setting CategoryFoster et al. (2023b)This Work
Bilinear/BE (Average Error)T^{3/4}/T^{5/6}T^{2/3}/T^{7/9}
Bellman-Complete (Squared Error)T^{2/3}√T

Breakthrough in Hybrid Environments (Table 2):

Setting CategoryLiu et al. (2025)This Work
Bilinear (Average Error)Full-info onlyT^{5/6}/T^{13/15}
Bellman-Complete (Squared Error)Not applicableT^{3/4}

Theoretical Advantage Verification

Theorem 13: In stochastic settings, dig-dec^{Φ,D}_η ≤ o-dec^{Φ,D}_η + η

Theorem 14: Constructs a three-arm bandit instance proving that in some cases:

  • Foster et al.'s method: EReg(a) ≥ Ω(√T)
  • This work: EReg(a) ≤ 1

Experimental Findings

  1. Importance of Information Gain: KL term captures distributional information overlooked by optimistic DEC
  2. Improvement in Estimation Procedures: Unbiased estimators significantly enhance concentration inequality effects
  3. Advantages of Removing Optimism: Enables natural extension of algorithms to adversarial environments

Main Research Directions

  1. DEC Framework Development:
    • Foster et al. (2021b, 2023a): Establish foundational DEC theory
    • Foster et al. (2023b): Introduce optimistic DEC
    • Chen et al. (2025): Extend to other settings
  2. Adversarial MDP Research:
    • Neu et al. (2013): Tabular adversarial MDPs
    • Luo et al. (2021): Linear adversarial MDPs
    • Liu et al. (2025): Hybrid MDP theory
  3. Model-Free Reinforcement Learning:
    • Jin et al. (2021): Bellman-eluder dimension
    • Xie et al. (2023): Coverability theory

Advantages of This Work

  1. Theoretical Unification: First DEC framework simultaneously handling stochastic and hybrid environments
  2. Technical Innovation: Dual information gain design, removing dependence on optimism
  3. Problem Resolution: Addresses open problems left by Liu et al. (2025)

Conclusions and Discussion

Main Conclusions

  1. Dig-DEC provides more precise complexity characterization, achieving arbitrarily large improvements in special cases
  2. Removing optimism enables algorithms to naturally handle adversarial environments
  3. Improved estimation procedures have important implications both theoretically and practically

Limitations

  1. Assumption Constraints: Hybrid settings require known linear reward features (Assumption 4)
  2. Technical Constraints: Certain low-rank MDP cases remain unaddressed
  3. Computational Complexity: Practical computational efficiency of saddle-point optimization not discussed in detail

Future Directions

  1. Relax constraints of Assumptions 3 and 4
  2. Investigate fundamental limits of model-free learning
  3. Develop more efficient computational algorithms

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contributions:
    • Proposes new complexity measure dig-dec
    • Unifies treatment of stochastic and adversarial environments
    • Resolves important open problems
  2. Strong Technical Innovation:
    • Dual information gain design is elegant
    • Estimation procedure improvements are effective
    • Analysis techniques are advanced
  3. Convincing Results:
    • Theoretical bounds are tight and practical
    • Constructed instances prove strict improvements
    • Covers multiple classical MDP classes

Weaknesses

  1. Limited Experimental Validation: Primarily theoretical analysis with lack of actual algorithm implementation and empirical verification
  2. Restricted Applicability: Strong assumptions on hybrid MDPs limit generality of the method
  3. Computational Complexity: Practical solvability and efficiency of saddle-point optimization requires further investigation

Impact

  1. Theoretical Value: Provides new direction for DEC theory development, potentially inspiring subsequent research
  2. Practical Potential: Offers theoretical guidance for practical reinforcement learning algorithm design
  3. Field Advancement: Achieves breakthrough at the intersection of model-free learning and adversarial MDPs

Applicable Scenarios

  1. Theoretical Research: DEC framework extension, complexity analysis
  2. Algorithm Design: Reinforcement learning algorithms requiring handling of hybrid environments
  3. Application Domains: Financial trading, recommendation systems, and other partially adversarial environments

References

Key references include:

  • Foster et al. (2021b, 2023a, 2023b): DEC theoretical foundations
  • Liu et al. (2025): Hybrid MDP research
  • Jin et al. (2021): Bellman-eluder dimension
  • Xie et al. (2023): Coverability theory
  • Xu and Zeevi (2023): AIR framework

This paper makes important progress in decision-estimation coefficient theory, solving key problems in the field through clever technical innovations and making valuable contributions to reinforcement learning theory development. While there is room for improvement in practical application validation, its theoretical value and innovation make it an important work in the field.