2025-11-23T13:10:17.147119

MADiff: Offline Multi-agent Learning with Diffusion Models

Zhu, Liu, Mao et al.
Offline reinforcement learning (RL) aims to learn policies from pre-existing datasets without further interactions, making it a challenging task. Q-learning algorithms struggle with extrapolation errors in offline settings, while supervised learning methods are constrained by model expressiveness. Recently, diffusion models (DMs) have shown promise in overcoming these limitations in single-agent learning, but their application in multi-agent scenarios remains unclear. Generating trajectories for each agent with independent DMs may impede coordination, while concatenating all agents' information can lead to low sample efficiency. Accordingly, we propose MADiff, which is realized with an attention-based diffusion model to model the complex coordination among behaviors of multiple agents. To our knowledge, MADiff is the first diffusion-based multi-agent learning framework, functioning as both a decentralized policy and a centralized controller. During decentralized executions, MADiff simultaneously performs teammate modeling, and the centralized controller can also be applied in multi-agent trajectory predictions. Our experiments demonstrate that MADiff outperforms baseline algorithms across various multi-agent learning tasks, highlighting its effectiveness in modeling complex multi-agent interactions. Our code is available at https://github.com/zbzhu99/madiff.
academic

MADiff: Offline Multi-agent Learning with Diffusion Models

Basic Information

  • Paper ID: 2305.17330
  • Title: MADiff: Offline Multi-agent Learning with Diffusion Models
  • Authors: Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, Weinan Zhang
  • Classification: cs.AI cs.LG
  • Publication Time/Conference: NeurIPS 2024 (38th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2305.17330

Abstract

Offline reinforcement learning (Offline RL) aims to learn policies from pre-existing datasets without further interaction, which is a challenging task. Q-learning algorithms suffer from extrapolation error in offline settings, while supervised learning methods are limited by model expressiveness. Recently, diffusion models (DMs) have shown promise in overcoming these limitations in single-agent learning, but their application in multi-agent scenarios remains unclear. Using independent DMs for each agent to generate trajectories may hinder coordination, while concatenating all agent information leads to low sample efficiency. Therefore, this paper proposes MADiff, which models complex coordination between multiple agent behaviors through attention-based diffusion models. To our knowledge, MADiff is the first diffusion-based multi-agent learning framework that functions both as a decentralized policy and as a centralized controller. During decentralized execution, MADiff simultaneously performs teammate modeling, and the centralized controller can also be applied to multi-agent trajectory prediction. Experiments demonstrate that MADiff outperforms baseline algorithms on various multi-agent learning tasks, highlighting its effectiveness in modeling complex multi-agent interactions.

Research Background and Motivation

Problem Background

  1. Challenges in Offline Multi-agent Reinforcement Learning: Compared to single-agent learning, offline multi-agent learning (MAL) has received less research attention and is more challenging. Since the behaviors of all agents are interdependent, each agent must model inter-agent interactions and coordination while making decisions in a decentralized manner to achieve objectives.
  2. Limitations of Existing Methods:
    • Q-learning Methods: Suffer from extrapolation error in offline settings; incorrect centralized value functions lead to significant extrapolation errors
    • Sequence Modeling Methods: Limited by model expressiveness; difficult to handle diverse datasets; suffer from compounding errors in autoregressive generation
    • Independent Diffusion Models: Using independent DMs for each agent may result in severe inconsistency due to lack of proper credit assignment
    • Simple Concatenation Methods: Concatenating all agent information as DM input/output ignores important characteristics of multi-agent systems
  3. Research Motivation:
    • Diffusion models demonstrate superior modeling capabilities in single-agent offline RL
    • Multi-agent systems require effective coordination mechanisms
    • Need for a unified framework supporting the centralized training decentralized execution (CTDE) paradigm

Core Contributions

  1. First Diffusion-based Multi-agent Learning Framework: Proposes MADiff, which unifies decentralized policies, centralized controllers, teammate modeling, and trajectory prediction functionalities
  2. Novel Attention-based Diffusion Model Architecture: Specifically designed for multi-agent learning, enabling inter-agent coordination at each denoising step
  3. Superior Experimental Performance: Achieves excellent results on various offline multi-agent problems, including offline MARL and trajectory prediction tasks

Methodology Details

Task Definition

This paper considers partially observable and fully cooperative multi-agent learning problems, formalized as Dec-POMDP: G=S,A,P,r,Ω,O,N,U,γG = \langle S,A, P, r,Ω, O,N,U, γ\rangle

Where:

  • SS and AA denote state and action spaces respectively
  • NN agents {1,2,...,N}\{1, 2, ..., N\} act at discrete time steps
  • Each agent ii observes only local observation oiΩo^i \in Ω
  • The optimization objective is to learn policies πiπ^i that maximize discounted cumulative rewards

Model Architecture

Overall Design

MADiff employs an attention-based diffusion network framework, performing cross-agent attention computation at the decoder layers of each agent.

Core Components

  1. U-Net Foundation: Adopts U-Net as the base structure for modeling trajectories of all agents, containing repeated one-dimensional convolutional residual blocks
  2. Attention Mechanism:
    • Applies attention layers before decoder blocks in all agents' U-Nets
    • Attention operations are performed on skip connection features clic^i_l from encoder layers
    • Uses multi-head attention mechanism to fuse encoded features
  3. Mathematical Expression:
    q^i = f_{query}(c^i), k^i = f_{key}(c^i), v^i = f_{value}(c^i)
    α_{ij} = exp(q^ik^j/√d_k) / Σ_p exp(q^ik^p/√d_k)
    ĉ^i = Σ_j α_{ij}v^j
    

Training Objective

Centralized training uses a joint loss function: L(θ,φ)=ΣiE(oi,ai,oi)D[aiIφi(oi,oi)2]+Ek,τ0D,β[εεθ(τ^k,(1β)y(τ0)+β,k)2]L(θ,φ) = Σ_i E_{(o^i,a^i,o'^i)∈D}[||a^i - I^i_φ(o^i, o'^i)||^2] + E_{k,τ_0∈D,β}[||ε - ε_θ(τ̂_k, (1-β)y(τ_0) + β∅, k)||^2]

Execution Modes

Centralized Control

  • Accesses current local observations of all agents
  • Generates trajectories for all agents and predicts actions
  • Applicable to multi-agent trajectory prediction and team games

Decentralized Execution with Teammate Modeling

  • Each agent uses only its own local observation for planning
  • Simultaneously infers observation sequences of other agents (teammate modeling)
  • Achieves effective coordination through attention mechanisms

Experimental Setup

Datasets

  1. Multi-Agent Particle Environment (MPE):
    • Spread: Three agents cover three landmarks
    • Tag: Three predators capture a pre-trained prey
    • World: Predators capture prey in a map with forests
    • Datasets: Expert, Medium-Replay, Medium, Random
  2. Multi-Agent Mujoco (MA Mujoco):
    • 2halfcheetah, 2ant, 4ant configurations
    • Datasets: Good, Medium, Poor
  3. StarCraft Multi-Agent Challenge (SMAC):
    • Maps: 3m, 2s3z, 5m_vs_6m, 8m
    • Datasets: Good, Medium, Poor
  4. NBA Dataset:
    • Basketball player trajectories from 631 games in the 2015-16 season
    • Used for multi-agent trajectory prediction tasks

Evaluation Metrics

  • Offline MARL: Episode rewards obtained from online rollouts
  • Trajectory Prediction: Distance-based metrics including ADE, FDE, minADE20, minFDE20

Baseline Methods

  • Offline MARL: MA-ICQ, MA-CQL, OMAR, MA-TD3+BC, MADT, BC
  • Trajectory Prediction: Baller2Vec++

Experimental Results

Main Results

Offline MARL Performance

MADiff achieves best results on most datasets:

TaskDatasetBCMA-CQLOMARMADIFF-DMADIFF-C
MPE SpreadExpert35.0±2.698.2±5.2114.9±2.695.0±5.3116.7±3.0
MPE TagExpert40.0±9.693.9±14.0116.2±19.8120.9±14.6167.6±18.6

Trajectory Prediction Performance

On the NBA dataset, MADIFF-C significantly outperforms baselines:

Trajectory LengthMetricBaller2Vec++MADIFF-C
20ADE15.15±0.387.92±0.86
20FDE24.91±0.6814.06±1.16

Ablation Studies

Validates the importance of the attention mechanism:

  • MADIFF-D with attention significantly outperforms the independent version
  • Advantages are more pronounced in more challenging tasks (e.g., World)
  • Parameter sharing strategy effectively reduces the number of parameters

Teammate Modeling Analysis

Visualization analysis on the Spread task demonstrates:

  • MADiff can correct teammate behavior predictions during rollout
  • Consistency ratio increases over time steps, eventually exceeding true rollout trajectories
  • Validates the effectiveness of teammate modeling

Multi-agent Offline RL

  • Q-learning Extensions: Methods like MA-BCQ, MA-ICQ suffer from extrapolation error
  • Sequence Modeling: MADT uses transformers but lacks inter-agent interaction modeling

Decision Diffusion Models

  • Single-agent Methods: Diffuser, Decision Diffusion achieve success in single-agent tasks
  • Contribution of This Work: First extension of diffusion models to multi-agent scenarios

Opponent Modeling

  • Rich literature on opponent modeling in online MARL
  • MADiff provides an effective offline teammate modeling solution

Conclusions and Discussion

Main Conclusions

  1. MADiff successfully extends diffusion models to multi-agent learning
  2. Attention mechanisms effectively achieve inter-agent coordination
  3. Unified framework supports multiple application scenarios
  4. Achieves excellent performance on various tasks

Limitations

  1. Scalability: Not suitable for scenarios with tens or hundreds of agents
  2. Stochastic Environments: May perform poorly in highly stochastic environments
  3. Computational Complexity: Requires inferring all teammate trajectories for each agent

Future Directions

  1. Explore latent representations to improve scalability
  2. Improve performance in stochastic environments
  3. Optimize computational efficiency

In-Depth Evaluation

Strengths

  1. Strong Innovation: First successful application of diffusion models to multi-agent learning
  2. Sophisticated Technical Design: Attention mechanism elegantly addresses agent coordination
  3. Comprehensive Experiments: Covers multiple domains and task types
  4. High Practical Value: Unified framework supports multiple application scenarios

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical guarantees for convergence and complexity
  2. Scalability Limitations: Limited applicability in large-scale multi-agent systems
  3. Sensitivity to Stochasticity: Performance degradation in high-randomness environments

Impact

  1. Academic Contribution: Provides new technical pathways for multi-agent learning
  2. Practical Value: Potential applications in robot coordination, game AI, and other domains
  3. Reproducibility: Provides complete code and experimental settings

Applicable Scenarios

  1. Offline multi-agent reinforcement learning tasks
  2. Multi-agent trajectory prediction
  3. Decision problems requiring agent coordination
  4. Cooperative tasks with medium scale (2-8 agents)

References

The paper cites multiple important works, including:

  • Foundational diffusion model work: Ho et al. (2020), Song and Ermon (2019)
  • Single-agent diffusion RL: Janner et al. (2022), Ajay et al. (2023)
  • Multi-agent RL baselines: Rashid et al. (2020), Meng et al. (2021)

Overall Assessment: This is a high-quality research paper that successfully introduces diffusion models to the multi-agent learning domain with significant technical innovation and comprehensive experimental validation. Despite some limitations, it opens new research directions in the field with important academic value and practical prospects.