2025-11-22T23:16:16.841585

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Zhang, Song, Li et al.

End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions such as planned trajectories, bypassing traditional modular pipelines. While these approaches have shown promise, they often operate under a one-shot paradigm that relies heavily on the current scene context, potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation restricts the model's ability to make informed and adaptive decisions in complex driving scenarios. We propose a new perspective: the future trajectory of an autonomous vehicle is closely intertwined with the evolving dynamics of its environment, and conversely, the vehicle's own future states can influence how the surrounding scene unfolds. Motivated by this bidirectional relationship, we introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner. Our method first predicts future bird's-eye view (BEV) representations to anticipate the dynamics of the surrounding scene, then leverages this foresight to generate future-context-aware trajectories. Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning, which refines both future scene prediction and trajectory generation through collaborative optimization. Extensive experiments on the NAVSIM and nuScenes benchmarks show that SeerDrive significantly outperforms existing state-of-the-art methods.

academic

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Basic Information

Paper ID: 2510.11092
Title: Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution
Authors: Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, Li Zhang
Category: cs.CV
Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2510.11092
Code Link: https://github.com/LogosRoboticsGroup/SeerDrive

Abstract

End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions (e.g., planned trajectories), bypassing traditional modular pipelines. While these methods show promise, they typically operate under a one-shot paradigm, heavily relying on current scene context and potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation constrains the model's ability to make informed and adaptive decisions in complex driving scenarios. This paper proposes a novel perspective: the future trajectory of an autonomous vehicle is closely related to the evolutionary dynamics of its environment, and conversely, the vehicle's own future state can influence the unfolding of the surrounding scene. Based on this bidirectional relationship, the authors introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner.

Research Background and Motivation

Problem Definition

Existing end-to-end autonomous driving methods primarily adopt a "one-shot paradigm," which directly predicts future trajectories spanning several seconds based on sensor observations at the current moment. This approach has the following key limitations:

Static Scene Assumption: Excessively relies on current scene conditions to infer the ego vehicle's future motion, neglecting how the scene evolves over time—a critical factor
Unidirectional Modeling: Fails to consider the impact of the ego vehicle's future behavior on the surrounding scene's evolution
Lack of Temporal Dynamics Modeling: In dynamic interactive driving environments, this approach limits the model's adaptive decision-making capabilities

Research Motivation

The authors observe two important bidirectional dependencies:

Future traffic dynamics influence the ego vehicle's motion planning
The ego vehicle's planned behavior reciprocally shapes the future scene

Based on this insight, the authors propose the need to explicitly model the bidirectional interaction between scene evolution and trajectory planning.

Core Contributions

Novel Paradigm: Proposes a new end-to-end driving paradigm that explicitly captures bidirectional interactions between scene dynamics and the ego vehicle's future behavior, challenging traditional one-shot planning approaches
Unified Framework Design: Instantiates the SeerDrive framework, jointly modeling future BEV scene representations and vehicle trajectories through future-aware and iterative interaction mechanisms
Performance Breakthrough: Achieves state-of-the-art performance on NAVSIM and nuScenes benchmarks, validating the design's effectiveness

Methodology Details

Task Definition

The end-to-end autonomous driving task maps sensor inputs (cameras and LiDAR) to future ego vehicle trajectories, typically using multimodal outputs to capture diverse possible futures. World models in autonomous driving aim to predict future scene evolution based on current observations.

Model Architecture

1. Feature Encoding

Given multi-view images I and LiDAR features P, the encoder transforms these multimodal sensor inputs into current BEV feature maps $F^{curr}_{bev} \in \mathbb{R}^{H \times W \times C}$ :

F^curr_bev = TransFuser(I, P)
F^curr_ego = EgoEncoder(T, E)
B^curr = BEVDecoder(F^curr_bev)

where T represents anchored multimodal trajectories and E represents ego vehicle state.

2. Future BEV World Modeling

The BEV world model predicts future BEV representations using structured BEV representations rather than complex image generation:

F^fut_scene = BEVWorldModel(F^curr_scene)
B^fut = BEVDecoder(F^fut_bev)

3. Future-Aware End-to-End Planning

The planning network jointly reasons about current and future scenes to generate planned trajectories. A decoupled strategy is employed where ego features interact separately with current and future BEV features:

F^curr_ego = TransformerDecoder(F^curr_ego, F^curr_bev)
F^fut_ego = TransformerDecoder(F^fut_ego, F^fut_bev)
Ta = EgoDecoder(F^curr_ego)
Tb = EgoDecoder(F^fut_ego)

Final fusion is achieved through Motion-aware Layer Normalization (MLN):

F^curr_ego = MLN(F^curr_ego, F^fut_ego)
T^final = EgoDecoder(F^curr_ego)

4. Iterative Scene Modeling and Vehicle Planning

The BEV world modeling network and end-to-end planning network operate iteratively, progressively improving planning performance. After N iterations, N pairs of predicted future semantic maps and ego vehicle trajectories are produced.

Technical Innovations

Bidirectional Modeling: First to explicitly model bidirectional dependencies between scene evolution and trajectory planning in end-to-end driving
Decoupled Interaction Strategy: Avoids representation entanglement caused by direct interaction between current and future BEV features
Iterative Optimization: Progressively refines scene prediction and trajectory generation through collaborative optimization
Motion-Aware Fusion: Effectively fuses current and future ego representations using MLN

Experimental Setup

Datasets

NAVSIM: Built on nuPlan, containing 1,192 training/validation scenes and 136 test scenes, 8 cameras + LiDAR, 2Hz
nuScenes: 1,000 scenes, 6 cameras + LiDAR, 2Hz, using standard 700/150 training/validation split

Evaluation Metrics

NAVSIM: PDM Score (PDMS), including No Collision (NC), Drivable Area Compliance (DAC), Time-to-Collision (TTC), Comfort (Comf.), Ego Progress (EP)
nuScenes: L2 displacement error and collision rate

Implementation Details

NAVSIM: ResNet34 backbone, 3 viewpoints, 1024×256 resolution, 256 trajectory modes, 4-second planning horizon
nuScenes: ResNet50 backbone, 6 viewpoints, 640×360 resolution, 6 trajectory modes, 3-second planning horizon
Training: 8 RTX 3090 GPUs, AdamW optimizer

Experimental Results

Main Results

NAVSIM Dataset Performance Comparison

Method	NC ↑	DAC ↑	TTC ↑	Comf. ↑	EP ↑	PDMS ↑
DiffusionDrive	98.2	96.2	94.7	100	82.2	88.1
WoTE	98.5	96.8	94.9	99.9	81.9	88.3
Hydra-NeXt	98.1	97.7	94.6	100	81.8	88.6
SeerDrive	98.4	97.0	94.9	99.9	83.2	88.9

SeerDrive achieves the highest PDMS score of 88.9 on NAVSIM, significantly outperforming existing methods.

nuScenes Dataset Performance Comparison

Method	L2 (m) ↓	Col. Rate (%) ↓
	1s/2s/3s/Avg.	1s/2s/3s/Avg.
SparseDrive	0.29/0.58/0.96/0.61	0.01/0.05/0.18/0.08
SeerDrive	0.20/0.39/0.69/0.43	0.00/0.05/0.14/0.06

On nuScenes, SeerDrive achieves significant improvements in both displacement error and collision rate.

Ablation Studies

Core Component Analysis

Future-aware plan	Iter. S&V	PDMS ↑
		87.1
✓		87.9
	✓	88.1
✓	✓	88.9

Results demonstrate that both core components contribute significantly to performance improvement.

Iteration Count Analysis

Iterations	PDMS ↑
1	88.1
2	88.9
3	88.7

Two iterations achieve the optimal balance between efficiency and performance.

Qualitative Results

The paper presents visualizations of right-turn and left-turn scenarios, demonstrating that the model can:

Accurately predict future BEV semantic maps
Generate planned trajectories highly consistent with ground truth
Capture multimodal possible future motions

End-to-End Autonomous Driving

Early Methods: Direct inference of trajectories or actions from sensor data
Unified Frameworks: UniAD unifies perception, prediction, and planning; VAD employs vectorized representations
Recent Progress: DiffusionDrive uses truncated diffusion strategies; DriveTransformer explores scaling laws

World Models in Autonomous Driving

Video Generation Methods: DriveDreamer, Drive-WM, etc., generate realistic videos
BEV Modeling: SLEDGE, GUMP, Scenario Dreamer, etc., model in BEV space
Joint Modeling: OccWorld, Drive-OccWorld, etc., jointly generate occupancy and actions

This paper differs from existing methods by achieving deep interaction between world modeling and planning.

Conclusions and Discussion

Main Conclusions

Proposes a novel paradigm for bidirectional modeling of scene evolution and trajectory planning
SeerDrive framework effectively implements future-aware end-to-end driving
Achieves state-of-the-art performance on two benchmark datasets

Limitations

Foundation Model Constraints: The BEV world model employs a specially designed transformer architecture, failing to leverage the generalization capabilities of foundation models
Inference Speed: Using off-the-shelf foundation models as world models suffers from slow inference speed and joint optimization difficulties
Complex Scene Handling: Failure cases persist in certain complex scenarios, such as incorrect lane selection and driving intent inference errors

Future Directions

Develop paradigms with tightly integrated planning and world modeling
Explore foundation model applications in end-to-end driving
Incorporate high-level driving intent to improve planning accuracy

In-Depth Evaluation

Strengths

Strong Innovation: First to systematically model bidirectional relationships between scene evolution and trajectory planning, breaking through traditional one-shot paradigms
Reasonable Technical Design: Decoupled interaction strategies, iterative optimization, and other design choices effectively address practical challenges
Comprehensive Experiments: Thorough evaluation across multiple datasets with detailed ablation studies
Significant Performance Gains: Demonstrates clear improvements on challenging NAVSIM and nuScenes benchmarks

Weaknesses

Computational Complexity: Iterative modeling increases computational overhead, requiring efficiency considerations for practical deployment
Generalization Capability: Specially designed architecture may limit generalization across different scenarios
Insufficient Failure Analysis: Deeper investigation into the root causes of model failures is needed

Impact

Academic Contribution: Provides new research paradigms and perspectives for the end-to-end autonomous driving field
Practical Value: Demonstrates good performance in real-world driving scenarios with application potential
Reproducibility: Provides detailed implementation details and open-source code, facilitating reproduction and future research

Applicable Scenarios

Complex urban driving environments
Scenarios requiring multi-agent interaction consideration
Autonomous driving systems with high planning accuracy requirements
End-to-end learning research in autonomous driving

References

The paper cites 58 relevant references covering key works in end-to-end autonomous driving, world models, and joint modeling, providing a solid theoretical foundation for this research.

Overall Assessment: This is a high-quality autonomous driving research paper that proposes an innovative bidirectional modeling paradigm with well-designed technical solutions and comprehensive experimental evaluation. It achieves significant performance improvements on important benchmarks and opens new research directions for end-to-end autonomous driving, demonstrating both substantial academic value and practical significance.