An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
Liu, Wei, Zimmert
We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting.
We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025).
We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.
academic
An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
This paper investigates the structured observation decision-making problem (DMSO). Prior work characterized the complexity of DMSO through the decision-estimation coefficient (DEC), but left a gap between upper and lower regret bounds related to the model class size. Foster et al. (2023b) introduced optimistic DEC to narrow this gap, achieving bounds that depend only on the value function class size. However, it remains unclear whether exploration based on optimism can be extended to adversarial environments.
This paper proposes Dig-DEC, a model-free DEC method that removes optimism and drives exploration purely through information gain. Dig-DEC is always no larger than optimistic DEC and can be significantly smaller in special cases. Importantly, removing optimism enables it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, the paper obtains the first model-free regret bounds for hybrid MDPs with bandit feedback under various general transition structures.
Problem to be Solved: Existing decision-estimation coefficient (DEC) frameworks exhibit gaps between model class size and value function class size, and optimism-based methods cannot effectively handle adversarial environments.
Problem Significance:
Online decision-making is a core problem in reinforcement learning
Practical applications often face hybrid environments that are partially stochastic and partially adversarial
Existing methods have gaps between theoretical guarantees and practical performance
Limitations of Existing Methods:
Foster et al.'s model based on DEC/E2D incurs a log|M| model estimation cost
While optimistic DEC improves complexity, it relies on the optimism principle and cannot handle adversarial settings
Liu et al. (2025)'s hybrid MDP method only handles full-information feedback; the bandit case remains open
Research Motivation: Develop a unified framework that improves existing results in stochastic environments while simultaneously addressing the bandit feedback case for hybrid MDPs for the first time.
Proposes Dig-DEC Complexity Measure: Introduces dual information gain decision-estimation coefficient that removes optimism and drives exploration purely through information gain
Unified Theoretical Framework: Constructs a general algorithmic framework capable of handling both stochastic and hybrid environments simultaneously
Improved Online Function Estimation:
Average estimation error: improved from T^{3/4}/T^{5/6} to T^{2/3}/T^{7/9}
Squared error: improved from T^{2/3} to √T, achieving the same performance as optimistic methods in Bellman-complete MDPs for the first time
Resolves Open Problem: Provides the first model-free regret bounds for hybrid MDPs under bandit feedback
Foster et al. (2021b, 2023a, 2023b): DEC theoretical foundations
Liu et al. (2025): Hybrid MDP research
Jin et al. (2021): Bellman-eluder dimension
Xie et al. (2023): Coverability theory
Xu and Zeevi (2023): AIR framework
This paper makes important progress in decision-estimation coefficient theory, solving key problems in the field through clever technical innovations and making valuable contributions to reinforcement learning theory development. While there is room for improvement in practical application validation, its theoretical value and innovation make it an important work in the field.