Realistic traffic simulation is critical for ensuring the safety and reliability of autonomous vehicles (AVs), especially in complex and diverse urban traffic environments. However, existing data-driven simulators face two key challenges: a limited focus on modeling dense, heterogeneous interactions at urban intersections - which are prevalent, crucial, and practically significant in countries like China, featuring diverse agents including motorized vehicles (MVs), non-motorized vehicles (NMVs), and pedestrians - and the inherent difficulty in robustly learning high-dimensional joint distributions for such high-density scenes, often leading to mode collapse and long-term simulation instability. We introduce City Crossings Dataset (CiCross), a large-scale dataset collected from a real-world urban intersection, uniquely capturing dense, heterogeneous multi-agent interactions, particularly with a substantial proportion of MVs, NMVs and pedestrians. Based on this dataset, we propose IntersectioNDE (Intersection Naturalistic Driving Environment), a data-driven simulator tailored for complex urban intersection scenarios. Its core component is the Interaction Decoupling Strategy (IDS), a training paradigm that learns compositional dynamics from agent subsets, enabling the marginal-to-joint simulation. Integrated into a scene-aware Transformer network with specialized training techniques, IDS significantly enhances simulation robustness and long-term stability for modeling heterogeneous interactions. Experiments on CiCross show that IntersectioNDE outperforms baseline methods in simulation fidelity, stability, and its ability to replicate complex, distribution-level urban traffic dynamics.
- Paper ID: 2510.11534
- Title: IntersectioNDE: Learning Complex Urban Traffic Dynamics based on Interaction Decoupling Strategy
- Authors: Enli Lin, Ziyuan Yang, Qiujing Lu, Jianming Hu, Shuo Feng (Tsinghua University)
- Categories: cs.RO (Robotics), cs.SY (Systems and Control), eess.SY (Systems and Control)
- Publication Date: October 13, 2025
- Paper Link: https://arxiv.org/abs/2510.11534
Realistic traffic simulation is crucial for ensuring the safety and reliability of autonomous vehicles (AVs), particularly in complex and diverse urban traffic environments. However, existing data-driven simulators face two critical challenges: limited focus on modeling dense heterogeneous interactions at urban intersections, and inherent difficulties in robustly learning high-dimensional joint distributions in high-density scenarios. This paper introduces the City Crossings Dataset (CiCross), a large-scale dataset collected from real urban intersections that uniquely captures dense heterogeneous multi-agent interactions. Based on this dataset, we propose IntersectioNDE, a data-driven simulator tailored for complex urban intersection scenarios. Its core component is the Interaction Decoupling Strategy (IDS), which enables learning compositional dynamics from agent subsets to achieve marginal-to-joint simulation.
The core problem addressed by this research is high-fidelity traffic simulation for complex urban intersections, particularly in dense heterogeneous interaction scenarios involving motorized vehicles (MVs), non-motorized vehicles (NMVs), and pedestrians.
- Autonomous Driving Safety Verification Needs: Simulation testing is widely adopted due to its scalability, cost-effectiveness, and ability to explore safety-critical edge cases
- Complex Urban Environment Challenges: Urban intersections in countries like China exhibit dense and heterogeneous traffic patterns that existing methods struggle to model effectively
- Practical Value: Accurate traffic simulation is critical for safe deployment of AV systems
- Insufficient Scenario Coverage: Existing data-driven simulators have limited focus on modeling dense heterogeneous urban intersection interactions
- Technical Challenges: Direct learning of full-scene high-dimensional joint distributions faces inherent difficulties, often resulting in mode collapse and long-term simulation instability
- Dataset Limitations: Existing datasets lack sufficient representation of dense interactions among MVs, NMVs, and pedestrians
To address the specific needs of complex urban traffic environments in countries like China, developing a traffic simulation system capable of robustly modeling heterogeneous interactions while maintaining long-term stability.
- Proposed the CiCross Dataset: A large-scale real urban intersection dataset that uniquely captures dense heterogeneous multi-agent interactions
- Designed the IntersectioNDE Simulator: A data-driven scenario-level simulator specifically tailored for complex urban intersection scenarios
- Innovated the Interaction Decoupling Strategy (IDS): A training paradigm that enables marginal-to-joint simulation by learning compositional dynamics from agent subsets
- Constructed a Scene-Aware Transformer Network: Integrating specialized training techniques to significantly enhance simulation robustness and long-term stability
The traffic simulation task is modeled as learning a generative model capable of producing realistic future scene states within a prediction time horizon Tpred.
Let Aτ={a1,...,aNτ} denote the set of Nτ agents present at time τ. The state of agent aj at time τ is sj,τ∈Sagent. A complete scene instance Gτ contains agent states Sτ, static map information M, and dynamic traffic light states Lτ.
The objective is to learn the conditional probability distribution:
Pdata(Gt+1:t+Tpred∣Gt−Thist+1:t)
- Agent Grouping: Partition the agent set At into k disjoint interaction groups based on predefined spatial and behavioral criteria (e.g., TTC):
At={At,1,At,2,...,At,k}
- Subset Sampling: Randomly sample a subset of group indices I⊆{1,...,k} to construct scene instances containing sampled agents
- Conditional Probability Learning: Train a neural network model Fθ to predict the conditional probability distribution of sampled future scene instances:
Pmodel(G^t+1:t+Tpred(I)∣Gt−Thist+1:tGT(I);θ)
- Training Objective: Minimize the expected negative log-likelihood:
L(θ)=−EG^∼DdataEI∼Psample(I)[logPmodel(G^t+1:t+Tpred(I)∣Gt−Thist+1:tGT(I);θ)]
During inference, the model achieves prediction from partial to complete scenes through the following mechanism:
- Interaction Primitive Learning: IDS training enables the model to acquire a diverse set of conditional interaction primitives P={p1,p2,...,pL}
- Primitive Recognition and Synthesis: For any scene Gt, the model first identifies the combination of learned interaction primitives in the current configuration, then synthesizes their future states
- Robustness Enhancement: By mastering fundamental building blocks, the model can coherently predict complex scene dynamics, even for interaction combinations not explicitly seen during training
A multi-input Transformer network with encoder-interaction-prediction structure:
- Multimodal Input Encoding:
- Historical agent trajectories: Ht−Thist+1:t∈RN×Thist×6
- Agent static attributes: As∈RN×6
- Route information: Mr∈RNR×DR
- Traffic light states: Md∈RThist×NL×3
- Dual Cross-Attention Module: Combines agent features with scene context features to produce environment-aware enhanced agent features
- Transformer Interaction Network: Models complex inter-agent dependencies
- Specialized Prediction Heads: Predicts future kinematic state distribution parameters for different agent categories
- Data Scale: Approximately 700 hours of recorded data, with 23.6 hours used in experiments
- Data Characteristics: 212,344 frames (2.5Hz), 56,578 unique agent instances
- Agent Distribution: 54.2% motorized vehicles, 43.3% non-motorized vehicles, 2.5% pedestrians
- Scene Characteristics: High agent density, TTC distribution peak around 2 seconds, reflecting high-risk interactions
- ADE (Average Displacement Error): Average displacement error
- FDE (Final Displacement Error): Final displacement error
- Missing Rate: Agent disappearance rate
- Collapse Time: Simulation collapse time
- Hardware: Single NVIDIA RTX 4090 GPU
- History length: Thist=10
- Prediction horizon: Tpred=10
- Data augmentation: Translation, rotation, displacement, trajectory error injection
- Closed-loop simulation: Autoregressive execution with 1-frame step size
All IDS-based models outperform baseline methods, validating the overall effectiveness of the strategy:
| Method | Agent Type | ADE↓ | FDE↓ | Missing Rate↓ |
|---|
| No IDS | Motorized Vehicles | 0.9047 | 1.6526 | 0.2086 |
| No IDS | Non-motorized Vehicles | 1.2864 | 2.4415 | 0.4553 |
| No IDS | Pedestrians | 1.2197 | 2.0536 | 0.3732 |
| IDS(TTC=1s) | Motorized Vehicles | 0.6693 | 1.2496 | 0.1750 |
| IDS(TTC=1s) | Non-motorized Vehicles | 0.9869 | 1.9694 | 0.3310 |
| IDS(TTC=1s) | Pedestrians | 1.0086 | 1.6150 | 0.2386 |
- TTC Threshold Sensitivity: Testing thresholds of 0s, 1s, 2s, and 4s, with 1s achieving optimal balance
- Attention Mechanism Comparison: Dual cross-attention outperforms single cross-attention variants
- Long-term Stability: IDS significantly improves collapse time (895s vs 15s)
Validates the model's ability to replicate distribution-level urban traffic dynamics by comparing simulated and real data velocity distributions and nearest-distance distributions.
Demonstrates three typical interaction scenarios:
- Non-motorized vehicle running red light encountering obstruction and decelerating
- Motorized vehicle yielding and decelerating
- Motorized vehicle turning right while encountering non-motorized vehicle flow and passing quickly
While existing datasets (Waymo, nuScenes, Argoverse, etc.) are large-scale and valuable, they have limitations in representing dense interactions at complex urban intersections.
- Rule-Based: SUMO, VISSIM, etc., rely on predefined parameters and struggle to reproduce the diversity of real driving behaviors
- Data-Driven:
- Agent-centric approaches: Learn individual behaviors but are inefficient and struggle to coordinate complex interactions
- Scene-level approaches: Directly output the next state of entire scenes but face high-dimensional distribution learning challenges
- The CiCross dataset successfully captures heterogeneous interaction characteristics of complex urban intersections
- The IDS strategy effectively addresses the challenge of learning high-dimensional joint distributions
- IntersectioNDE significantly outperforms baseline methods in simulation fidelity, stability, and distribution replication capability
- Dataset Geographic Specificity: Primarily based on Chinese urban intersections, potentially exhibiting geographic bias
- Computational Complexity: Transformer architecture incurs computational overhead in large-scale scenarios
- Interaction Definition: TTC-based interaction grouping may oversimplify complex interaction patterns
- Long-term Evaluation: While stability is improved, very long-term simulation performance requires further validation
- Extend to more geographic regions and traffic patterns
- Optimize computational efficiency
- Explore more fine-grained interaction modeling methods
- Integrate additional sensor modalities
- Strong Problem Targeting: Focuses on practical needs of complex urban traffic in countries like China
- High Methodological Innovation: IDS strategy cleverly addresses high-dimensional distribution learning challenges
- Significant Dataset Value: CiCross fills a gap in dense heterogeneous interaction data
- Comprehensive Experiments: Includes detailed ablation studies and case analyses
- Strong Practical Value: Significantly improves long-term simulation stability
- Insufficient Theoretical Analysis: Lacks theoretical convergence analysis of the IDS strategy
- Limited Comparison Scope: Primarily compares against self-built baselines, lacking comparison with other SOTA methods
- Unknown Generalization Ability: Validated only on single intersection data; cross-scene generalization capability remains uncertain
- Unreported Computational Overhead: Lacks detailed analysis of training and inference time
- Academic Contribution: Provides new insights for complex urban traffic simulation
- Practical Value: Significant for safety verification of AV systems in complex urban environments
- Data Contribution: CiCross dataset can promote related research development
- Reproducibility: Clear method description with good reproducibility
- Urban Intersection Simulation: Particularly suitable for high-density, multi-type agent interaction scenarios
- Autonomous Driving Testing: Provides tools for safety verification of AV systems in complex urban environments
- Traffic Planning: Can be used for urban traffic flow analysis and optimization
- Research Platform: Provides a foundational platform for traffic behavior modeling research
The paper cites important works in traffic simulation, autonomous driving, and deep learning, including the Waymo dataset, NeuralNDE, and various Transformer architectures, reflecting comprehensive understanding and deep insights into related fields.