2025-11-13T03:28:10.622967

Distributionally Robust Markov Decision Processes and their Connection to Risk Measures

BÃ¤uerle, Glauner

We consider robust Markov Decision Processes with Borel state and action spaces, unbounded cost and finite time horizon. Our formulation leads to a Stackelberg game against nature. Under integrability, continuity and compactness assumptions we derive a robust cost iteration for a fixed policy of the decision maker and a value iteration for the robust optimization problem. Moreover, we show the existence of deterministic optimal policies for both players. This is in contrast to classical zero-sum games. In case the state space is the real line we show under some convexity assumptions that the interchange of supremum and infimum is possible with the help of Sion's minimax Theorem. Further, we consider the problem with special ambiguity sets. In particular we are able to derive some cases where the robust optimization problem coincides with the minimization of a coherent risk measure. In the final section we discuss two applications: A robust LQ problem and a robust problem for managing regenerative energy.

academic

Distributionally Robust Markov Decision Processes and their Connection to Risk Measures

基本信息

论文ID: 2007.13103
标题: Distributionally Robust Markov Decision Processes and their Connection to Risk Measures
作者: Nicole Bäuerle, Alexander Glauner
分类: math.OC (数学优化与控制), q-fin.RM (定量金融风险管理)
发表时间: 2020年7月26日
论文链接: https://arxiv.org/abs/2007.13103

摘要

本文研究了具有Borel状态和行动空间、无界成本和有限时间范围的鲁棒马尔科夫决策过程。该问题被建模为与自然对抗的Stackelberg博弈。在可积性、连续性和紧致性假设下，作者推导出了决策者固定策略下的鲁棒成本迭代和鲁棒优化问题的值迭代。此外，证明了双方都存在确定性最优策略，这与经典零和博弈形成对比。当状态空间为实数线时，在某些凸性假设下，利用Sion极小极大定理可以实现上确界和下确界的交换。文章还考虑了特殊模糊集的情况，特别是推导出鲁棒优化问题与相干风险度量最小化重合的情形。

研究背景与动机

问题背景

传统的马尔科夫决策过程(MDP)假设所有参数和分布都是已知或可以精确估计的。然而，在实际应用中，当真实参数或分布偏离假设时，使用这种"最优"策略可能导致性能显著退化。

研究动机

模型不确定性问题：现实中转移概率往往无法精确获得，存在模型模糊性(model ambiguity)
风险厌恶需求：Ellsberg悖论表明决策者倾向于模糊性厌恶
理论局限性：现有鲁棒MDP研究主要限制在有限状态和行动空间
应用需求：需要处理连续状态空间和无界成本函数的实际问题

现有方法局限性

大多数研究局限于可数或有限的状态和行动空间
缺乏对连续空间和无界成本的处理
与风险度量的联系不够深入
缺乏对确定性最优策略存在性的证明

核心贡献

扩展理论框架：将现有鲁棒MDP理论从可数空间扩展到Borel空间，处理无界成本函数
博弈理论建模：将问题建模为Stackelberg博弈，自然作为跟随者，决策者作为领导者
最优策略存在性：证明了双方确定性最优策略的存在性，这与经典零和博弈不同
极值交换条件：在凸性假设下，利用Sion极小极大定理实现上确界和下确界的交换
风险度量联系：建立了特殊模糊集下鲁棒优化与相干风险度量的等价性
实际应用：提供了鲁棒LQ问题和可再生能源管理两个应用实例

方法详解

任务定义

考虑有限时间范围N的马尔科夫决策过程：

状态空间：E (Borel空间)
行动空间：A (Borel空间)
转移函数： $T_n: D_n \times Z \to E$
成本函数： $c_n: D_n \times E \to \mathbb{R}$
扰动： $Z_1, \ldots, Z_N$ 独立随机元素

目标是最小化最坏情况下的期望成本： $V_0(x) = \inf_{\pi \in \Pi^R} \sup_{\gamma \in \Gamma} V_0^{\pi\gamma}(x)$

模型架构

1. 模糊集建模

定义模糊集 $\mathcal{Q}_n \subseteq M_q(\Omega_n, \mathcal{A}_n, P_n)$ ，其中：

$M_q(\Omega_n, \mathcal{A}_n, P_n)$ ：关于 $P_n$ 绝对连续的概率测度集
赋予弱*拓扑 $\sigma(L^q, L^p)$ ，其中 $\frac{1}{p} + \frac{1}{q} = 1$

2. Stackelberg博弈结构

决策者：选择策略 $\pi = (\pi_0, \pi_1, \ldots, \pi_{N-1})$
自然：观察决策者行动后选择 $\gamma = (\gamma_0, \ldots, \gamma_{N-1})$
信息结构：自然是跟随者，可观察到决策者的行动

3. 值函数递归关系

在假设条件下，值函数满足Bellman方程： $J_n(x) = \inf_{a \in D_n(x)} \sup_{Q \in \mathcal{Q}_{n+1}} L_n J_{n+1}(x,a,Q)$

其中： $L_n v(x,a,Q) = \int c_n(x,a,T_n(x,a,z)) + v(T_n(x,a,z)) \, Q(dz)$

技术创新点

1. 可测选择定理应用

利用Rieder的可测选择定理处理连续空间中的测度性问题，确保最优策略的存在性。

2. 弱*拓扑处理

采用弱*拓扑 $\sigma(L^q, L^p)$ 而非弱收敛拓扑，便于建立与递归风险度量的联系。

3. 边界函数技术

引入上下边界函数 $\bar{b}$ 和 $\underline{b}$ 处理无界成本，确保值函数的良定义性。

4. 凸性分析

在凸模型假设下，利用Sion极小极大定理实现： $\inf_{a \in D_n(x)} \sup_{Q \in \mathcal{Q}_{n+1}} L_n J_{n+1}(x,a,Q) = \sup_{Q \in \mathcal{Q}_{n+1}} \inf_{a \in D_n(x)} L_n J_{n+1}(x,a,Q)$