2025-11-25T17:58:17.832731

IntersectioNDE: Learning Complex Urban Traffic Dynamics based on Interaction Decoupling Strategy

Lin, Yang, Lu et al.

Realistic traffic simulation is critical for ensuring the safety and reliability of autonomous vehicles (AVs), especially in complex and diverse urban traffic environments. However, existing data-driven simulators face two key challenges: a limited focus on modeling dense, heterogeneous interactions at urban intersections - which are prevalent, crucial, and practically significant in countries like China, featuring diverse agents including motorized vehicles (MVs), non-motorized vehicles (NMVs), and pedestrians - and the inherent difficulty in robustly learning high-dimensional joint distributions for such high-density scenes, often leading to mode collapse and long-term simulation instability. We introduce City Crossings Dataset (CiCross), a large-scale dataset collected from a real-world urban intersection, uniquely capturing dense, heterogeneous multi-agent interactions, particularly with a substantial proportion of MVs, NMVs and pedestrians. Based on this dataset, we propose IntersectioNDE (Intersection Naturalistic Driving Environment), a data-driven simulator tailored for complex urban intersection scenarios. Its core component is the Interaction Decoupling Strategy (IDS), a training paradigm that learns compositional dynamics from agent subsets, enabling the marginal-to-joint simulation. Integrated into a scene-aware Transformer network with specialized training techniques, IDS significantly enhances simulation robustness and long-term stability for modeling heterogeneous interactions. Experiments on CiCross show that IntersectioNDE outperforms baseline methods in simulation fidelity, stability, and its ability to replicate complex, distribution-level urban traffic dynamics.

academic

IntersectioNDE: Learning Complex Urban Traffic Dynamics based on Interaction Decoupling Strategy

Basic Information

Paper ID: 2510.11534
Title: IntersectioNDE: Learning Complex Urban Traffic Dynamics based on Interaction Decoupling Strategy
Authors: Enli Lin, Ziyuan Yang, Qiujing Lu, Jianming Hu, Shuo Feng (Tsinghua University)
Categories: cs.RO (Robotics), cs.SY (Systems and Control), eess.SY (Systems and Control)
Publication Date: October 13, 2025
Paper Link: https://arxiv.org/abs/2510.11534

Abstract

Realistic traffic simulation is crucial for ensuring the safety and reliability of autonomous vehicles (AVs), particularly in complex and diverse urban traffic environments. However, existing data-driven simulators face two critical challenges: limited focus on modeling dense heterogeneous interactions at urban intersections, and inherent difficulties in robustly learning high-dimensional joint distributions in high-density scenarios. This paper introduces the City Crossings Dataset (CiCross), a large-scale dataset collected from real urban intersections that uniquely captures dense heterogeneous multi-agent interactions. Based on this dataset, we propose IntersectioNDE, a data-driven simulator tailored for complex urban intersection scenarios. Its core component is the Interaction Decoupling Strategy (IDS), which enables learning compositional dynamics from agent subsets to achieve marginal-to-joint simulation.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is high-fidelity traffic simulation for complex urban intersections, particularly in dense heterogeneous interaction scenarios involving motorized vehicles (MVs), non-motorized vehicles (NMVs), and pedestrians.

Problem Significance

Autonomous Driving Safety Verification Needs: Simulation testing is widely adopted due to its scalability, cost-effectiveness, and ability to explore safety-critical edge cases
Complex Urban Environment Challenges: Urban intersections in countries like China exhibit dense and heterogeneous traffic patterns that existing methods struggle to model effectively
Practical Value: Accurate traffic simulation is critical for safe deployment of AV systems

Limitations of Existing Methods

Insufficient Scenario Coverage: Existing data-driven simulators have limited focus on modeling dense heterogeneous urban intersection interactions
Technical Challenges: Direct learning of full-scene high-dimensional joint distributions faces inherent difficulties, often resulting in mode collapse and long-term simulation instability
Dataset Limitations: Existing datasets lack sufficient representation of dense interactions among MVs, NMVs, and pedestrians

Research Motivation

To address the specific needs of complex urban traffic environments in countries like China, developing a traffic simulation system capable of robustly modeling heterogeneous interactions while maintaining long-term stability.

Core Contributions

Proposed the CiCross Dataset: A large-scale real urban intersection dataset that uniquely captures dense heterogeneous multi-agent interactions
Designed the IntersectioNDE Simulator: A data-driven scenario-level simulator specifically tailored for complex urban intersection scenarios
Innovated the Interaction Decoupling Strategy (IDS): A training paradigm that enables marginal-to-joint simulation by learning compositional dynamics from agent subsets
Constructed a Scene-Aware Transformer Network: Integrating specialized training techniques to significantly enhance simulation robustness and long-term stability

Methodology Details

Task Definition

The traffic simulation task is modeled as learning a generative model capable of producing realistic future scene states within a prediction time horizon $T_{pred}$ .

Let $A_τ = \{a_1, ..., a_{N_τ}\}$ denote the set of $N_τ$ agents present at time $τ$ . The state of agent $a_j$ at time $τ$ is $s_{j,τ} ∈ S_{agent}$ . A complete scene instance $G_τ$ contains agent states $S_τ$ , static map information $M$ , and dynamic traffic light states $L_τ$ .

The objective is to learn the conditional probability distribution: $P_{data}(G_{t+1:t+T_{pred}} | G_{t-T_{hist}+1:t})$

Interaction Decoupling Strategy (IDS)

IDS Training Process

Agent Grouping: Partition the agent set $A_t$ into $k$ disjoint interaction groups based on predefined spatial and behavioral criteria (e.g., TTC): $A_t = \{A_{t,1}, A_{t,2}, ..., A_{t,k}\}$
Subset Sampling: Randomly sample a subset of group indices $I ⊆ \{1, ..., k\}$ to construct scene instances containing sampled agents
Conditional Probability Learning: Train a neural network model $F_θ$ to predict the conditional probability distribution of sampled future scene instances: $P_{model}(\hat{G}_{t+1:t+T_{pred}}(I) | G^{GT}_{t-T_{hist}+1:t}(I); θ)$
Training Objective: Minimize the expected negative log-likelihood: $L(θ) = -E_{\hat{G}∼D_{data}} E_{I∼P_{sample}(I)}[\log P_{model}(\hat{G}_{t+1:t+T_{pred}}(I) | G^{GT}_{t-T_{hist}+1:t}(I); θ)]$

Marginal-to-Joint Simulation

During inference, the model achieves prediction from partial to complete scenes through the following mechanism:

Interaction Primitive Learning: IDS training enables the model to acquire a diverse set of conditional interaction primitives $P = \{p_1, p_2, ..., p_L\}$
Primitive Recognition and Synthesis: For any scene $G_t$ , the model first identifies the combination of learned interaction primitives in the current configuration, then synthesizes their future states
Robustness Enhancement: By mastering fundamental building blocks, the model can coherently predict complex scene dynamics, even for interaction combinations not explicitly seen during training

Network Architecture

Scene-Aware Interaction Transformer

A multi-input Transformer network with encoder-interaction-prediction structure:

Multimodal Input Encoding:
- Historical agent trajectories: $H_{t-T_{hist}+1:t} ∈ R^{N×T_{hist}×6}$
- Agent static attributes: $A_s ∈ R^{N×6}$
- Route information: $M_r ∈ R^{N_R×D_R}$
- Traffic light states: $M_d ∈ R^{T_{hist}×N_L×3}$
Dual Cross-Attention Module: Combines agent features with scene context features to produce environment-aware enhanced agent features
Transformer Interaction Network: Models complex inter-agent dependencies
Specialized Prediction Heads: Predicts future kinematic state distribution parameters for different agent categories

Experimental Setup

CiCross Dataset

Data Scale: Approximately 700 hours of recorded data, with 23.6 hours used in experiments
Data Characteristics: 212,344 frames (2.5Hz), 56,578 unique agent instances
Agent Distribution: 54.2% motorized vehicles, 43.3% non-motorized vehicles, 2.5% pedestrians
Scene Characteristics: High agent density, TTC distribution peak around 2 seconds, reflecting high-risk interactions

Evaluation Metrics

ADE (Average Displacement Error): Average displacement error
FDE (Final Displacement Error): Final displacement error
Missing Rate: Agent disappearance rate
Collapse Time: Simulation collapse time

Implementation Details

Hardware: Single NVIDIA RTX 4090 GPU
History length: $T_{hist} = 10$
Prediction horizon: $T_{pred} = 10$
Data augmentation: Translation, rotation, displacement, trajectory error injection
Closed-loop simulation: Autoregressive execution with 1-frame step size

Experimental Results

Main Results

All IDS-based models outperform baseline methods, validating the overall effectiveness of the strategy:

Method	Agent Type	ADE↓	FDE↓	Missing Rate↓
No IDS	Motorized Vehicles	0.9047	1.6526	0.2086
No IDS	Non-motorized Vehicles	1.2864	2.4415	0.4553
No IDS	Pedestrians	1.2197	2.0536	0.3732
IDS(TTC=1s)	Motorized Vehicles	0.6693	1.2496	0.1750
IDS(TTC=1s)	Non-motorized Vehicles	0.9869	1.9694	0.3310
IDS(TTC=1s)	Pedestrians	1.0086	1.6150	0.2386

Ablation Studies

TTC Threshold Sensitivity: Testing thresholds of 0s, 1s, 2s, and 4s, with 1s achieving optimal balance
Attention Mechanism Comparison: Dual cross-attention outperforms single cross-attention variants
Long-term Stability: IDS significantly improves collapse time (895s vs 15s)

Distribution Fidelity Assessment

Validates the model's ability to replicate distribution-level urban traffic dynamics by comparing simulated and real data velocity distributions and nearest-distance distributions.

Case Analysis

Demonstrates three typical interaction scenarios:

Non-motorized vehicle running red light encountering obstruction and decelerating
Motorized vehicle yielding and decelerating
Motorized vehicle turning right while encountering non-motorized vehicle flow and passing quickly

Traffic Datasets

While existing datasets (Waymo, nuScenes, Argoverse, etc.) are large-scale and valuable, they have limitations in representing dense interactions at complex urban intersections.

Traffic Simulation Methods

Rule-Based: SUMO, VISSIM, etc., rely on predefined parameters and struggle to reproduce the diversity of real driving behaviors
Data-Driven:
- Agent-centric approaches: Learn individual behaviors but are inefficient and struggle to coordinate complex interactions
- Scene-level approaches: Directly output the next state of entire scenes but face high-dimensional distribution learning challenges

Conclusions and Discussion

Main Conclusions

The CiCross dataset successfully captures heterogeneous interaction characteristics of complex urban intersections
The IDS strategy effectively addresses the challenge of learning high-dimensional joint distributions
IntersectioNDE significantly outperforms baseline methods in simulation fidelity, stability, and distribution replication capability

Limitations

Dataset Geographic Specificity: Primarily based on Chinese urban intersections, potentially exhibiting geographic bias
Computational Complexity: Transformer architecture incurs computational overhead in large-scale scenarios
Interaction Definition: TTC-based interaction grouping may oversimplify complex interaction patterns
Long-term Evaluation: While stability is improved, very long-term simulation performance requires further validation

Future Directions

Extend to more geographic regions and traffic patterns
Optimize computational efficiency
Explore more fine-grained interaction modeling methods
Integrate additional sensor modalities

In-Depth Evaluation

Strengths

Strong Problem Targeting: Focuses on practical needs of complex urban traffic in countries like China
High Methodological Innovation: IDS strategy cleverly addresses high-dimensional distribution learning challenges
Significant Dataset Value: CiCross fills a gap in dense heterogeneous interaction data
Comprehensive Experiments: Includes detailed ablation studies and case analyses
Strong Practical Value: Significantly improves long-term simulation stability

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical convergence analysis of the IDS strategy
Limited Comparison Scope: Primarily compares against self-built baselines, lacking comparison with other SOTA methods
Unknown Generalization Ability: Validated only on single intersection data; cross-scene generalization capability remains uncertain
Unreported Computational Overhead: Lacks detailed analysis of training and inference time

Impact

Academic Contribution: Provides new insights for complex urban traffic simulation
Practical Value: Significant for safety verification of AV systems in complex urban environments
Data Contribution: CiCross dataset can promote related research development
Reproducibility: Clear method description with good reproducibility

Applicable Scenarios

Urban Intersection Simulation: Particularly suitable for high-density, multi-type agent interaction scenarios
Autonomous Driving Testing: Provides tools for safety verification of AV systems in complex urban environments
Traffic Planning: Can be used for urban traffic flow analysis and optimization
Research Platform: Provides a foundational platform for traffic behavior modeling research

References

The paper cites important works in traffic simulation, autonomous driving, and deep learning, including the Waymo dataset, NeuralNDE, and various Transformer architectures, reflecting comprehensive understanding and deep insights into related fields.