2025-11-17T12:28:12.099327

Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

Tang, Cheng, Kumar

The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.

academic

Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

基本信息

论文ID: 2510.11877
标题: Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling
作者: Xiaohang Tang (University College London), Zhuowen Cheng (Independent Researcher), Satyabrat Kumar (University College London)
分类: cs.LG cs.GT
发表时间/会议: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Reliable ML
论文链接: https://arxiv.org/abs/2510.11877

摘要

Transformer作为序列建模的高表达力架构，最近被适配用于解决序列决策问题，其中最著名的是Decision Transformer (DT)，通过对期望回报进行条件化来学习策略。然而，基于序列建模的强化学习方法的对抗鲁棒性在很大程度上仍未被探索。本文介绍了Conservative Adversarially Robust Decision Transformer (CART)，据我们所知，这是第一个旨在增强DT在对抗随机博弈中鲁棒性的框架。我们将每个阶段主角和对手之间的交互建模为阶段博弈，其中收益定义为后续状态的期望最大值，从而明确地纳入了随机状态转移。通过在从这些阶段博弈导出的NashQ值上条件化Transformer策略，CART生成的策略同时具有较低可利用性（对抗鲁棒）和对转移不确定性的保守性。

研究背景与动机

问题定义

本研究要解决的核心问题是在随机博弈环境中提高Decision Transformer的对抗鲁棒性。具体来说：

Decision Transformer的脆弱性：虽然DT在序列决策任务中表现出色，但在对抗环境中容易被利用，因为它通过模仿学习的方式学习策略，高回报可能仅仅归因于对手策略的弱点而非真正的鲁棒性。
现有方法的局限性：Adversarially Robust Decision Transformer (ARDT)虽然通过条件化极小极大回报来缓解这一问题，但其适用性仅限于确定性状态转移的对抗强化学习，在随机状态转移的博弈中可能表现出过度乐观。
随机性处理的挑战：在随机博弈中，状态转移本质上是概率性的，ARDT可能因为仅条件化极小极大回报而忽略转移概率，导致对高回报子博弈访问概率的误估。