2025-11-17T12:28:12.099327

Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

Tang, Cheng, Kumar
The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.
academic

Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

Basic Information

  • Paper ID: 2510.11877
  • Title: Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling
  • Authors: Xiaohang Tang (University College London), Zhuowen Cheng (Independent Researcher), Satyabrat Kumar (University College London)
  • Classification: cs.LG cs.GT
  • Publication Time/Venue: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Reliable ML
  • Paper Link: https://arxiv.org/abs/2510.11877

Abstract

Transformers, as highly expressive architectures for sequence modeling, have recently been adapted to solve sequential decision-making problems, most notably through Decision Transformer (DT), which learns policies by conditioning on expected returns. However, the adversarial robustness of sequence modeling-based reinforcement learning methods remains largely unexplored. This paper introduces Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance DT's robustness in adversarial stochastic games. We model the interaction between the protagonist and adversary at each stage as a stage game, where payoffs are defined as the expected maximum value of subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning the Transformer policy on NashQ values derived from these stage games, CART generates policies that are simultaneously low-exploitability (adversarially robust) and conservative with respect to transition uncertainty.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is to improve the adversarial robustness of Decision Transformer in stochastic game environments. Specifically:

  1. Vulnerability of Decision Transformer: Although DT demonstrates excellent performance in sequential decision-making tasks, it is susceptible to exploitation in adversarial environments because it learns policies through imitation learning, where high returns may be attributable solely to opponent strategy weaknesses rather than genuine robustness.
  2. Limitations of Existing Methods: While Adversarially Robust Decision Transformer (ARDT) mitigates this issue by conditioning on minimax returns, its applicability is limited to adversarial reinforcement learning with deterministic state transitions. It may exhibit excessive optimism in games with stochastic state transitions.
  3. Challenges in Handling Stochasticity: In stochastic games, state transitions are inherently probabilistic. ARDT may overlook transition probabilities by conditioning solely on minimax returns, leading to misestimation of the probability of visiting high-return subgames.

Research Significance

The importance of this problem is reflected in:

  • Practical Value: Real-world multi-agent systems often involve uncertainty and adversarial interactions
  • Theoretical Significance: Fills a research gap in sequence modeling's adversarial robustness
  • Safety: Enhances the reliability of AI systems in adversarial environments

Core Contributions

  1. First Robust Decision Transformer Framework for Stochastic Games: Proposes CART, the first method specifically designed to enhance DT's robustness in adversarial stochastic games.
  2. Stage Game Modeling: Models protagonist-adversary interactions at each time step as stage games, with payoff functions defined as the expected maximum value of subsequent states, explicitly considering stochastic state transitions.
  3. NashQ Value Estimation Algorithm: Combines Expectile Regression and Temporal Difference (TD) learning to solve optimal minimax Q-values across all stages.
  4. Empirical Validation: Validates CART's superiority in minimax value estimation accuracy and worst-case returns across multiple synthetic stochastic games.

Methodology Details

Task Definition

Stochastic games are defined as (S,A,Aˉ,T,R)(S,A,\bar{A},T,R), where:

  • SS: State space
  • A,AˉA,\bar{A}: Action spaces for protagonist and adversary
  • TT: Transition probability distribution st+1T(st,at,aˉt)s_{t+1} \sim T(\cdot|s_t,a_t,\bar{a}_t)
  • RR: Reward function

The objective is to learn a protagonist policy robust to adaptive adversaries: (π,πˉ)=maxπminπˉEτρπ,πˉ[trt](\pi^*,\bar{\pi}^*) = \max_\pi \min_{\bar{\pi}} E_{\tau\sim\rho^{\pi,\bar{\pi}}}[\sum_t r_t]

Model Architecture

1. Stage Game Modeling

The interaction at each time step is modeled as a stage game, where: Qˉ(s,a,aˉ)=EsT(s,a)[r+V(s)]\bar{Q}(s,a,\bar{a}) = E_{s'\sim T(\cdot|s,a)}[r + V(s')]V(s)=maxaQ(s,a)V(s') = \max_{a'} Q(s',a')

Here, the VV function represents the expected value of executing the optimal protagonist action at the next stage state ss'.

2. NashQ Value Computation

The NashQ value for sequential games is defined as: QCART(s,a)=minaˉQ(s,a,aˉ)Q_{CART}(s,a) = \min_{\bar{a}} Q(s,a,\bar{a})

3. Practical Algorithm Implementation

Due to inefficiency of direct min/max operations, expectile regression is employed for approximation:

Step 1: Learn Stage Game PayoffsL(Qˉ)=E(s,a,aˉ,r,s)D[Qˉ(s,a,aˉ)V(s)r]L(\bar{Q}) = E_{(s,a,\bar{a},r,s')\sim D}[\bar{Q}(s,a,\bar{a}) - V(s') - r]

Step 2: Estimate NashQ ValuesL(Q)=E(s,a,aˉ,r,s)D[LERα0(Q(s,a)Qˉ(s,a,aˉ))]L(Q) = E_{(s,a,\bar{a},r,s')\sim D}[L^{\alpha\to0}_{ER}(Q(s,a) - \bar{Q}(s,a,\bar{a}))]

Step 3: Approximate Optimal State Value FunctionL(V)=E(s,a)D[LERα1(V(s)Q(s,a))]L(V) = E_{(s',a')\sim D}[L^{\alpha\to1}_{ER}(V(s') - Q(s',a'))]

where the expectile regression objective is defined as: LERα(u)=E[uα1(u>0)u2]L^\alpha_{ER}(u) = E[u|\alpha - \mathbf{1}(u>0)| \cdot u^2]

Technical Innovations

  1. Explicit Stochasticity Handling: By introducing an additional state value function VV, explicitly accounts for randomness in state transitions, avoiding ARDT's excessive optimism.
  2. Integration of Expectile Regression and TD Learning: Innovatively applies expectile regression to approximate min/max operations, enabling more efficient learning on trajectory data.
  3. Balancing Conservatism and Robustness: By conditioning on NashQ values, generates policies that are simultaneously adversarially robust and conservative with respect to transition uncertainty.

Experimental Setup

Datasets

Experiments are conducted on synthetic stochastic games, including:

  1. Two-Stage Stochastic Games: Primary illustrative examples
  2. Three-Stage Stochastic Games: More complex sequential interactions
  3. Five Game Variants: Test robustness under different stochasticity settings

Data collection uses uniform random behavior policies, containing 10510^5 trajectories covering all possible trajectories.

Evaluation Metrics

  • Worst-Case Return: Policy performance against optimal adversaries
  • Minimax Value Estimation Accuracy: Deviation from theoretical values

Baseline Methods

  • Decision Transformer (DT): Original decision transformer
  • Adversarially Robust Decision Transformer (ARDT): Existing adversarial robustness method

Implementation Details

  • Test-time adversary assumed to be optimal
  • Decoding with high target returns
  • Alternating optimization of three loss functions until convergence

Experimental Results

Main Results

Two-Stage Stochastic Game Results

In the illustrative two-stage stochastic game:

  • CART: 8.0 (worst-case return)
  • ARDT: 5.7
  • DT: 6.0

Average Performance Across Five Games

Average performance across five synthetic adversarial stochastic games:

  • CART: 8.115 ± lowest variance
  • ARDT: 5.948
  • DT: 6.421

Key Findings

  1. Target Return Sensitivity: CART maintains the highest worst-case return across different target return settings, while ARDT and DT achieve lower returns under adversarial attacks.
  2. Excessive Optimism Issue: ARDT is easily misled by rare high-reward trajectories, overestimating action values while neglecting true transition probabilities, losing robustness at high target returns.
  3. Conservatism Advantage: CART jointly considers rewards and state transition stochasticity, focusing on feasible policies that maximize worst-case expected returns.

Case Analysis

In the illustrative example of Figure 1:

  • ARDT ignores the small probability of reaching desired state s2s'_2, resulting in overly optimistic state and action value estimates
  • CART handles stochasticity by allocating expected maximum values, yielding more conservative and accurate value estimates

Stochastic Game Solving

Two-player game solving in online learning has been extensively studied through online self-play for regret minimization to converge to Nash equilibrium. However, this work focuses on offline learning settings.

Offline Reinforcement Learning

  • Conservative Q-Learning (CQL): Mitigates Q-value overestimation through pessimistic objectives
  • Implicit Q-Learning (IQL): Achieves value stabilization through expectile regression for implicit value function learning
  • ARDT: Achieves adversarial robustness in static zero-sum games through minimax expectile regression

Decision Transformer Extensions

  • Trajectory Transformer: Captures trajectory stochasticity through latent variables
  • Online Decision Transformer: Integrates hybrid offline-online reinforcement learning
  • Multi-Game Decision Transformer: Supports transfer learning and few-shot adaptation

Conclusions and Discussion

Main Conclusions

CART successfully addresses the adversarial robustness problem of DT in stochastic games through:

  1. Modeling interactions as stage games, explicitly considering stochastic transitions
  2. Using NashQ values for conditioning, generating policies that are both robust and conservative
  3. Achieving superior worst-case performance across multiple stochastic games

Limitations

  1. Experimental Scale: Currently validated only on short-horizon synthetic games
  2. Computational Complexity: Alternating optimization of three objectives may increase computational overhead
  3. Theoretical Analysis: Lacks theoretical guarantees for convergence and robustness

Future Directions

  1. Extension to Complex Environments: Such as poker variants (Kuhn and Leduc poker) and more complex multi-agent competitive environments
  2. Long-Horizon Planning: Exploring larger-scale games and longer planning horizons
  3. Theoretical Refinement: Providing theoretical analysis of convergence and robustness

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to introduce adversarial robustness into sequence modeling for stochastic games, filling an important research gap
  2. Sound Methodology: Elegantly addresses the dual challenges of stochasticity and adversariality through stage game modeling and expectile regression
  3. Comprehensive Experiments: Although in synthetic environments, multiple variants are designed to validate method effectiveness
  4. Important Problem: Addresses a problem with significant practical and theoretical value

Weaknesses

  1. Experimental Limitations: Validated only in simple synthetic environments, lacking verification on real-world applications
  2. Theoretical Gaps: Lacks theoretical analysis of convergence, complexity, and robustness
  3. Method Complexity: Requires alternating optimization of multiple objectives, potentially affecting practicality
  4. Limited Comparisons: Compares only with ARDT and DT, lacking comparison with other robust reinforcement learning methods

Impact

  1. Academic Contribution: Opens new directions for sequence modeling applications in adversarial environments
  2. Practical Value: Provides new insights for developing more robust multi-agent systems
  3. Reproducibility: Clear method description and simple experimental setup facilitate reproduction

Applicable Scenarios

  1. Multi-Agent Systems: Environments with adversarial and uncertain interactions
  2. Safety-Critical Applications: Scenarios requiring worst-case performance guarantees
  3. Offline Learning: Environments where online interaction is infeasible

References

This paper cites important works from reinforcement learning, game theory, and sequence modeling, including:

  • Chen et al. (2021) - Original Decision Transformer work
  • Tang et al. (2024a) - ARDT method
  • Hu and Wellman (2003) - Nash Q-Learning
  • Vaswani et al. (2017) - Transformer architecture

Overall Assessment: This is a high-quality research paper addressing an important and challenging problem. While there is room for improvement in experimental validation and theoretical analysis, its innovation and methodological soundness make it a valuable contribution to the field.