2025-11-15T23:58:12.055440

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Liu, Wei, Zimmert

We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

academic

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

基本信息

论文ID: 2510.08882
标题: An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs
作者: Haolin Liu (University of Virginia), Chen-Yu Wei (University of Virginia), Julian Zimmert (Google Research)
分类: cs.LG (Machine Learning)
发表时间: 2025年10月
论文链接: https://arxiv.org/abs/2510.08882v1

摘要

本文研究结构化观测决策制定问题(DMSO)。先前工作通过决策估计系数(DEC)刻画了DMSO的复杂度，但在遗憾上界和下界之间留下了与模型类大小相关的间隙。Foster等人(2023b)引入了乐观DEC来缩小这一间隙，实现了仅与值函数类大小相关的界。然而，基于乐观性的探索仅能处理随机环境，是否能扩展到对抗环境尚不明确。

本文提出了Dig-DEC，一个无模型的DEC方法，去除了乐观性并纯粹通过信息增益驱动探索。Dig-DEC总是不大于乐观DEC，在特殊情况下可以显著更小。重要的是，去除乐观性使其能够处理对抗环境而无需显式奖励估计器。通过将Dig-DEC应用于具有随机转移和对抗奖励的混合MDP，获得了首个在多种一般转移结构下带bandit反馈的混合MDP的无模型遗憾界。

研究背景与动机

要解决的问题: 现有的决策估计系数(DEC)框架在模型类大小和值函数类大小之间存在间隙，且基于乐观性的方法无法有效处理对抗环境。
问题重要性:
- 在线决策制定是强化学习的核心问题
- 实际应用中常面临部分随机、部分对抗的混合环境
- 现有方法在理论保证和实际性能之间存在gap
现有方法局限性:
- Foster等人的模型基于DEC/E2D需要承担log|M|的模型估计成本
- 乐观DEC虽然改善了复杂度，但依赖乐观原理，无法处理对抗设置
- Liu等人(2025)的混合MDP方法仅能处理全信息反馈，bandit情况仍为开放问题
研究动机: 开发一个统一框架，既能在随机环境中改进现有结果，又能首次处理混合MDP的bandit反馈情况。

核心贡献

提出Dig-DEC复杂度度量: 引入双信息增益决策估计系数，去除乐观性，纯粹通过信息增益驱动探索
统一理论框架: 构建了能同时处理随机和混合环境的通用算法框架
改进的在线函数估计:
- 平均估计误差：从T^{3/4}/T^{5/6}改进到T^{2/3}/T^{7/9}
- 平方误差：从T^{2/3}改进到√T，首次在Bellman完备MDP中达到与乐观方法相同性能
解决开放问题: 首次给出混合MDP在bandit反馈下的无模型遗憾界

方法详解

任务定义

DMSO框架: 给定模型空间M、策略空间Π、观测空间O和值函数V。每轮t中：

环境选择模型Mt ∈ M
学习者选择策略πt ∈ Π
观测ot ~ Mt(·|πt)
目标：最小化遗憾Reg(π*) = Σt(VMt(π*) - VMt(πt))

Φ-受限环境: 通过信息集Φ对M×Π进行划分，每个信息集ϕ包含单一策略πϕ。

模型架构

1. 通用框架(Algorithm 1)

核心思想是解决如下鞍点问题：

min_{p∈Δ(Π)} max_{ν∈Δ(Ψ)} AIR^{Φ,D}_η(p,ν;ρt)

其中散度度量为：

D^π(ν||ρ) = E_{M~ν}E_{o~M(·|π)}[KL(ν_{ϕ}(·|π,o), ρ) + E_{ϕ~ρ}[D^π(ϕ||M)]]

2. Dig-DEC定义

dig-dec^{Φ,D}_η = max_{ρ∈Δ(Φ)} min_{p∈Δ(Π)} max_{ν∈Δ(Ψ)} 
E_{π~p}E_{(M,π*)~ν}[V_M(π*) - V_M(π) - (1/η)E_{o~M(·|π)}[KL(ν_{ϕ}(·|π,o), ρ)] - (1/η)E_{ϕ~ρ}[D^π(ϕ||M)]]