2025-11-17T12:28:12.099327

Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

Tang, Cheng, Kumar

The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.

academic

Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

Basic Information

Paper ID: 2510.11877
Title: Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling
Authors: Xiaohang Tang (University College London), Zhuowen Cheng (Independent Researcher), Satyabrat Kumar (University College London)
Classification: cs.LG cs.GT
Publication Time/Venue: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Reliable ML
Paper Link: https://arxiv.org/abs/2510.11877

Abstract

Transformers, as highly expressive architectures for sequence modeling, have recently been adapted to solve sequential decision-making problems, most notably through Decision Transformer (DT), which learns policies by conditioning on expected returns. However, the adversarial robustness of sequence modeling-based reinforcement learning methods remains largely unexplored. This paper introduces Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance DT's robustness in adversarial stochastic games. We model the interaction between the protagonist and adversary at each stage as a stage game, where payoffs are defined as the expected maximum value of subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning the Transformer policy on NashQ values derived from these stage games, CART generates policies that are simultaneously low-exploitability (adversarially robust) and conservative with respect to transition uncertainty.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is to improve the adversarial robustness of Decision Transformer in stochastic game environments. Specifically:

Vulnerability of Decision Transformer: Although DT demonstrates excellent performance in sequential decision-making tasks, it is susceptible to exploitation in adversarial environments because it learns policies through imitation learning, where high returns may be attributable solely to opponent strategy weaknesses rather than genuine robustness.
Limitations of Existing Methods: While Adversarially Robust Decision Transformer (ARDT) mitigates this issue by conditioning on minimax returns, its applicability is limited to adversarial reinforcement learning with deterministic state transitions. It may exhibit excessive optimism in games with stochastic state transitions.
Challenges in Handling Stochasticity: In stochastic games, state transitions are inherently probabilistic. ARDT may overlook transition probabilities by conditioning solely on minimax returns, leading to misestimation of the probability of visiting high-return subgames.

Research Significance

The importance of this problem is reflected in:

Practical Value: Real-world multi-agent systems often involve uncertainty and adversarial interactions
Theoretical Significance: Fills a research gap in sequence modeling's adversarial robustness
Safety: Enhances the reliability of AI systems in adversarial environments

Core Contributions

First Robust Decision Transformer Framework for Stochastic Games: Proposes CART, the first method specifically designed to enhance DT's robustness in adversarial stochastic games.
Stage Game Modeling: Models protagonist-adversary interactions at each time step as stage games, with payoff functions defined as the expected maximum value of subsequent states, explicitly considering stochastic state transitions.
NashQ Value Estimation Algorithm: Combines Expectile Regression and Temporal Difference (TD) learning to solve optimal minimax Q-values across all stages.
Empirical Validation: Validates CART's superiority in minimax value estimation accuracy and worst-case returns across multiple synthetic stochastic games.

Methodology Details

Task Definition

Stochastic games are defined as $(S,A,\bar{A},T,R)$ , where:

$S$ : State space
$A,\bar{A}$ : Action spaces for protagonist and adversary
$T$ : Transition probability distribution $s_{t+1} \sim T(\cdot|s_t,a_t,\bar{a}_t)$
$R$ : Reward function

The objective is to learn a protagonist policy robust to adaptive adversaries: $(\pi^*,\bar{\pi}^*) = \max_\pi \min_{\bar{\pi}} E_{\tau\sim\rho^{\pi,\bar{\pi}}}[\sum_t r_t]$

Model Architecture

1. Stage Game Modeling

The interaction at each time step is modeled as a stage game, where: $\bar{Q}(s,a,\bar{a}) = E_{s'\sim T(\cdot|s,a)}[r + V(s')]$ $V(s') = \max_{a'} Q(s',a')$

Here, the $V$ function represents the expected value of executing the optimal protagonist action at the next stage state $s'$ .

2. NashQ Value Computation

The NashQ value for sequential games is defined as: $Q_{CART}(s,a) = \min_{\bar{a}} Q(s,a,\bar{a})$

3. Practical Algorithm Implementation

Due to inefficiency of direct min/max operations, expectile regression is employed for approximation:

Step 1: Learn Stage Game Payoffs $L(\bar{Q}) = E_{(s,a,\bar{a},r,s')\sim D}[\bar{Q}(s,a,\bar{a}) - V(s') - r]$

Step 2: Estimate NashQ Values $L(Q) = E_{(s,a,\bar{a},r,s')\sim D}[L^{\alpha\to0}_{ER}(Q(s,a) - \bar{Q}(s,a,\bar{a}))]$

Step 3: Approximate Optimal State Value Function $L(V) = E_{(s',a')\sim D}[L^{\alpha\to1}_{ER}(V(s') - Q(s',a'))]$

where the expectile regression objective is defined as: $L^\alpha_{ER}(u) = E[u|\alpha - \mathbf{1}(u>0)| \cdot u^2]$

Technical Innovations

Explicit Stochasticity Handling: By introducing an additional state value function $V$ , explicitly accounts for randomness in state transitions, avoiding ARDT's excessive optimism.
Integration of Expectile Regression and TD Learning: Innovatively applies expectile regression to approximate min/max operations, enabling more efficient learning on trajectory data.
Balancing Conservatism and Robustness: By conditioning on NashQ values, generates policies that are simultaneously adversarially robust and conservative with respect to transition uncertainty.

Experimental Setup

Datasets

Experiments are conducted on synthetic stochastic games, including:

Two-Stage Stochastic Games: Primary illustrative examples
Three-Stage Stochastic Games: More complex sequential interactions
Five Game Variants: Test robustness under different stochasticity settings

Data collection uses uniform random behavior policies, containing $10^5$ trajectories covering all possible trajectories.

Evaluation Metrics

Worst-Case Return: Policy performance against optimal adversaries
Minimax Value Estimation Accuracy: Deviation from theoretical values

Baseline Methods

Decision Transformer (DT): Original decision transformer
Adversarially Robust Decision Transformer (ARDT): Existing adversarial robustness method

Implementation Details

Test-time adversary assumed to be optimal
Decoding with high target returns
Alternating optimization of three loss functions until convergence

Experimental Results

Main Results

Two-Stage Stochastic Game Results

In the illustrative two-stage stochastic game:

CART: 8.0 (worst-case return)
ARDT: 5.7
DT: 6.0

Average Performance Across Five Games

Average performance across five synthetic adversarial stochastic games:

CART: 8.115 ± lowest variance
ARDT: 5.948
DT: 6.421

Key Findings

Target Return Sensitivity: CART maintains the highest worst-case return across different target return settings, while ARDT and DT achieve lower returns under adversarial attacks.
Excessive Optimism Issue: ARDT is easily misled by rare high-reward trajectories, overestimating action values while neglecting true transition probabilities, losing robustness at high target returns.
Conservatism Advantage: CART jointly considers rewards and state transition stochasticity, focusing on feasible policies that maximize worst-case expected returns.

Case Analysis

In the illustrative example of Figure 1:

ARDT ignores the small probability of reaching desired state $s'_2$ , resulting in overly optimistic state and action value estimates
CART handles stochasticity by allocating expected maximum values, yielding more conservative and accurate value estimates

Stochastic Game Solving

Two-player game solving in online learning has been extensively studied through online self-play for regret minimization to converge to Nash equilibrium. However, this work focuses on offline learning settings.

Offline Reinforcement Learning

Conservative Q-Learning (CQL): Mitigates Q-value overestimation through pessimistic objectives
Implicit Q-Learning (IQL): Achieves value stabilization through expectile regression for implicit value function learning
ARDT: Achieves adversarial robustness in static zero-sum games through minimax expectile regression

Decision Transformer Extensions

Trajectory Transformer: Captures trajectory stochasticity through latent variables
Online Decision Transformer: Integrates hybrid offline-online reinforcement learning
Multi-Game Decision Transformer: Supports transfer learning and few-shot adaptation

Conclusions and Discussion

Main Conclusions

CART successfully addresses the adversarial robustness problem of DT in stochastic games through:

Modeling interactions as stage games, explicitly considering stochastic transitions
Using NashQ values for conditioning, generating policies that are both robust and conservative
Achieving superior worst-case performance across multiple stochastic games

Limitations

Experimental Scale: Currently validated only on short-horizon synthetic games
Computational Complexity: Alternating optimization of three objectives may increase computational overhead
Theoretical Analysis: Lacks theoretical guarantees for convergence and robustness

Future Directions

Extension to Complex Environments: Such as poker variants (Kuhn and Leduc poker) and more complex multi-agent competitive environments
Long-Horizon Planning: Exploring larger-scale games and longer planning horizons
Theoretical Refinement: Providing theoretical analysis of convergence and robustness

In-Depth Evaluation

Strengths

Strong Innovation: First to introduce adversarial robustness into sequence modeling for stochastic games, filling an important research gap
Sound Methodology: Elegantly addresses the dual challenges of stochasticity and adversariality through stage game modeling and expectile regression
Comprehensive Experiments: Although in synthetic environments, multiple variants are designed to validate method effectiveness
Important Problem: Addresses a problem with significant practical and theoretical value

Weaknesses

Experimental Limitations: Validated only in simple synthetic environments, lacking verification on real-world applications
Theoretical Gaps: Lacks theoretical analysis of convergence, complexity, and robustness
Method Complexity: Requires alternating optimization of multiple objectives, potentially affecting practicality
Limited Comparisons: Compares only with ARDT and DT, lacking comparison with other robust reinforcement learning methods

Impact

Academic Contribution: Opens new directions for sequence modeling applications in adversarial environments
Practical Value: Provides new insights for developing more robust multi-agent systems
Reproducibility: Clear method description and simple experimental setup facilitate reproduction

Applicable Scenarios

Multi-Agent Systems: Environments with adversarial and uncertain interactions
Safety-Critical Applications: Scenarios requiring worst-case performance guarantees
Offline Learning: Environments where online interaction is infeasible

References

This paper cites important works from reinforcement learning, game theory, and sequence modeling, including:

Chen et al. (2021) - Original Decision Transformer work
Tang et al. (2024a) - ARDT method
Hu and Wellman (2003) - Nash Q-Learning
Vaswani et al. (2017) - Transformer architecture

Overall Assessment: This is a high-quality research paper addressing an important and challenging problem. While there is room for improvement in experimental validation and theoretical analysis, its innovation and methodological soundness make it a valuable contribution to the field.