2025-11-23T13:10:17.147119

MADiff: Offline Multi-agent Learning with Diffusion Models

Zhu, Liu, Mao et al.

Offline reinforcement learning (RL) aims to learn policies from pre-existing datasets without further interactions, making it a challenging task. Q-learning algorithms struggle with extrapolation errors in offline settings, while supervised learning methods are constrained by model expressiveness. Recently, diffusion models (DMs) have shown promise in overcoming these limitations in single-agent learning, but their application in multi-agent scenarios remains unclear. Generating trajectories for each agent with independent DMs may impede coordination, while concatenating all agents' information can lead to low sample efficiency. Accordingly, we propose MADiff, which is realized with an attention-based diffusion model to model the complex coordination among behaviors of multiple agents. To our knowledge, MADiff is the first diffusion-based multi-agent learning framework, functioning as both a decentralized policy and a centralized controller. During decentralized executions, MADiff simultaneously performs teammate modeling, and the centralized controller can also be applied in multi-agent trajectory predictions. Our experiments demonstrate that MADiff outperforms baseline algorithms across various multi-agent learning tasks, highlighting its effectiveness in modeling complex multi-agent interactions. Our code is available at https://github.com/zbzhu99/madiff.

academic

MADiff: Offline Multi-agent Learning with Diffusion Models

Basic Information

Paper ID: 2305.17330
Title: MADiff: Offline Multi-agent Learning with Diffusion Models
Authors: Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, Weinan Zhang
Classification: cs.AI cs.LG
Publication Time/Conference: NeurIPS 2024 (38th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2305.17330

Abstract

Offline reinforcement learning (Offline RL) aims to learn policies from pre-existing datasets without further interaction, which is a challenging task. Q-learning algorithms suffer from extrapolation error in offline settings, while supervised learning methods are limited by model expressiveness. Recently, diffusion models (DMs) have shown promise in overcoming these limitations in single-agent learning, but their application in multi-agent scenarios remains unclear. Using independent DMs for each agent to generate trajectories may hinder coordination, while concatenating all agent information leads to low sample efficiency. Therefore, this paper proposes MADiff, which models complex coordination between multiple agent behaviors through attention-based diffusion models. To our knowledge, MADiff is the first diffusion-based multi-agent learning framework that functions both as a decentralized policy and as a centralized controller. During decentralized execution, MADiff simultaneously performs teammate modeling, and the centralized controller can also be applied to multi-agent trajectory prediction. Experiments demonstrate that MADiff outperforms baseline algorithms on various multi-agent learning tasks, highlighting its effectiveness in modeling complex multi-agent interactions.

Research Background and Motivation

Problem Background

Challenges in Offline Multi-agent Reinforcement Learning: Compared to single-agent learning, offline multi-agent learning (MAL) has received less research attention and is more challenging. Since the behaviors of all agents are interdependent, each agent must model inter-agent interactions and coordination while making decisions in a decentralized manner to achieve objectives.
Limitations of Existing Methods:
- Q-learning Methods: Suffer from extrapolation error in offline settings; incorrect centralized value functions lead to significant extrapolation errors
- Sequence Modeling Methods: Limited by model expressiveness; difficult to handle diverse datasets; suffer from compounding errors in autoregressive generation
- Independent Diffusion Models: Using independent DMs for each agent may result in severe inconsistency due to lack of proper credit assignment
- Simple Concatenation Methods: Concatenating all agent information as DM input/output ignores important characteristics of multi-agent systems
Research Motivation:
- Diffusion models demonstrate superior modeling capabilities in single-agent offline RL
- Multi-agent systems require effective coordination mechanisms
- Need for a unified framework supporting the centralized training decentralized execution (CTDE) paradigm

Core Contributions

First Diffusion-based Multi-agent Learning Framework: Proposes MADiff, which unifies decentralized policies, centralized controllers, teammate modeling, and trajectory prediction functionalities
Novel Attention-based Diffusion Model Architecture: Specifically designed for multi-agent learning, enabling inter-agent coordination at each denoising step
Superior Experimental Performance: Achieves excellent results on various offline multi-agent problems, including offline MARL and trajectory prediction tasks

Methodology Details

Task Definition

This paper considers partially observable and fully cooperative multi-agent learning problems, formalized as Dec-POMDP: $G = \langle S,A, P, r,Ω, O,N,U, γ\rangle$

Where:

$S$ and $A$ denote state and action spaces respectively
$N$ agents $\{1, 2, ..., N\}$ act at discrete time steps
Each agent $i$ observes only local observation $o^i \in Ω$
The optimization objective is to learn policies $π^i$ that maximize discounted cumulative rewards

Model Architecture

Overall Design

MADiff employs an attention-based diffusion network framework, performing cross-agent attention computation at the decoder layers of each agent.

Core Components

U-Net Foundation: Adopts U-Net as the base structure for modeling trajectories of all agents, containing repeated one-dimensional convolutional residual blocks
Attention Mechanism:
- Applies attention layers before decoder blocks in all agents' U-Nets
- Attention operations are performed on skip connection features $c^i_l$ from encoder layers
- Uses multi-head attention mechanism to fuse encoded features

Mathematical Expression:

q^i = f_{query}(c^i), k^i = f_{key}(c^i), v^i = f_{value}(c^i)
α_{ij} = exp(q^ik^j/√d_k) / Σ_p exp(q^ik^p/√d_k)
ĉ^i = Σ_j α_{ij}v^j

Training Objective

Centralized training uses a joint loss function: $L(θ,φ) = Σ_i E_{(o^i,a^i,o'^i)∈D}[||a^i - I^i_φ(o^i, o'^i)||^2] + E_{k,τ_0∈D,β}[||ε - ε_θ(τ̂_k, (1-β)y(τ_0) + β∅, k)||^2]$

Execution Modes

Centralized Control

Accesses current local observations of all agents
Generates trajectories for all agents and predicts actions
Applicable to multi-agent trajectory prediction and team games

Decentralized Execution with Teammate Modeling

Each agent uses only its own local observation for planning
Simultaneously infers observation sequences of other agents (teammate modeling)
Achieves effective coordination through attention mechanisms

Experimental Setup

Datasets

Multi-Agent Particle Environment (MPE):
- Spread: Three agents cover three landmarks
- Tag: Three predators capture a pre-trained prey
- World: Predators capture prey in a map with forests
- Datasets: Expert, Medium-Replay, Medium, Random
Multi-Agent Mujoco (MA Mujoco):
- 2halfcheetah, 2ant, 4ant configurations
- Datasets: Good, Medium, Poor
StarCraft Multi-Agent Challenge (SMAC):
- Maps: 3m, 2s3z, 5m_vs_6m, 8m
- Datasets: Good, Medium, Poor
NBA Dataset:
- Basketball player trajectories from 631 games in the 2015-16 season
- Used for multi-agent trajectory prediction tasks

Evaluation Metrics

Offline MARL: Episode rewards obtained from online rollouts
Trajectory Prediction: Distance-based metrics including ADE, FDE, minADE20, minFDE20

Baseline Methods

Offline MARL: MA-ICQ, MA-CQL, OMAR, MA-TD3+BC, MADT, BC
Trajectory Prediction: Baller2Vec++

Experimental Results

Main Results

Offline MARL Performance

MADiff achieves best results on most datasets:

Task	Dataset	BC	MA-CQL	OMAR	MADIFF-D	MADIFF-C
MPE Spread	Expert	35.0±2.6	98.2±5.2	114.9±2.6	95.0±5.3	116.7±3.0
MPE Tag	Expert	40.0±9.6	93.9±14.0	116.2±19.8	120.9±14.6	167.6±18.6

Trajectory Prediction Performance

On the NBA dataset, MADIFF-C significantly outperforms baselines:

Trajectory Length	Metric	Baller2Vec++	MADIFF-C
20	ADE	15.15±0.38	7.92±0.86
20	FDE	24.91±0.68	14.06±1.16

Ablation Studies

Validates the importance of the attention mechanism:

MADIFF-D with attention significantly outperforms the independent version
Advantages are more pronounced in more challenging tasks (e.g., World)
Parameter sharing strategy effectively reduces the number of parameters

Teammate Modeling Analysis

Visualization analysis on the Spread task demonstrates:

MADiff can correct teammate behavior predictions during rollout
Consistency ratio increases over time steps, eventually exceeding true rollout trajectories
Validates the effectiveness of teammate modeling

Multi-agent Offline RL

Q-learning Extensions: Methods like MA-BCQ, MA-ICQ suffer from extrapolation error
Sequence Modeling: MADT uses transformers but lacks inter-agent interaction modeling

Decision Diffusion Models

Single-agent Methods: Diffuser, Decision Diffusion achieve success in single-agent tasks
Contribution of This Work: First extension of diffusion models to multi-agent scenarios

Opponent Modeling

Rich literature on opponent modeling in online MARL
MADiff provides an effective offline teammate modeling solution

Conclusions and Discussion

Main Conclusions

MADiff successfully extends diffusion models to multi-agent learning
Attention mechanisms effectively achieve inter-agent coordination
Unified framework supports multiple application scenarios
Achieves excellent performance on various tasks

Limitations

Scalability: Not suitable for scenarios with tens or hundreds of agents
Stochastic Environments: May perform poorly in highly stochastic environments
Computational Complexity: Requires inferring all teammate trajectories for each agent

Future Directions

Explore latent representations to improve scalability
Improve performance in stochastic environments
Optimize computational efficiency

In-Depth Evaluation

Strengths

Strong Innovation: First successful application of diffusion models to multi-agent learning
Sophisticated Technical Design: Attention mechanism elegantly addresses agent coordination
Comprehensive Experiments: Covers multiple domains and task types
High Practical Value: Unified framework supports multiple application scenarios

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical guarantees for convergence and complexity
Scalability Limitations: Limited applicability in large-scale multi-agent systems
Sensitivity to Stochasticity: Performance degradation in high-randomness environments

Impact

Academic Contribution: Provides new technical pathways for multi-agent learning
Practical Value: Potential applications in robot coordination, game AI, and other domains
Reproducibility: Provides complete code and experimental settings

Applicable Scenarios

Offline multi-agent reinforcement learning tasks
Multi-agent trajectory prediction
Decision problems requiring agent coordination
Cooperative tasks with medium scale (2-8 agents)

References

The paper cites multiple important works, including:

Foundational diffusion model work: Ho et al. (2020), Song and Ermon (2019)
Single-agent diffusion RL: Janner et al. (2022), Ajay et al. (2023)
Multi-agent RL baselines: Rashid et al. (2020), Meng et al. (2021)

Overall Assessment: This is a high-quality research paper that successfully introduces diffusion models to the multi-agent learning domain with significant technical innovation and comprehensive experimental validation. Despite some limitations, it opens new research directions in the field with important academic value and practical prospects.