2025-11-22T23:16:16.841585

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Zhang, Song, Li et al.
End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions such as planned trajectories, bypassing traditional modular pipelines. While these approaches have shown promise, they often operate under a one-shot paradigm that relies heavily on the current scene context, potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation restricts the model's ability to make informed and adaptive decisions in complex driving scenarios. We propose a new perspective: the future trajectory of an autonomous vehicle is closely intertwined with the evolving dynamics of its environment, and conversely, the vehicle's own future states can influence how the surrounding scene unfolds. Motivated by this bidirectional relationship, we introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner. Our method first predicts future bird's-eye view (BEV) representations to anticipate the dynamics of the surrounding scene, then leverages this foresight to generate future-context-aware trajectories. Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning, which refines both future scene prediction and trajectory generation through collaborative optimization. Extensive experiments on the NAVSIM and nuScenes benchmarks show that SeerDrive significantly outperforms existing state-of-the-art methods.
academic

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Basic Information

  • Paper ID: 2510.11092
  • Title: Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution
  • Authors: Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, Li Zhang
  • Category: cs.CV
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.11092
  • Code Link: https://github.com/LogosRoboticsGroup/SeerDrive

Abstract

End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions (e.g., planned trajectories), bypassing traditional modular pipelines. While these methods show promise, they typically operate under a one-shot paradigm, heavily relying on current scene context and potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation constrains the model's ability to make informed and adaptive decisions in complex driving scenarios. This paper proposes a novel perspective: the future trajectory of an autonomous vehicle is closely related to the evolutionary dynamics of its environment, and conversely, the vehicle's own future state can influence the unfolding of the surrounding scene. Based on this bidirectional relationship, the authors introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner.

Research Background and Motivation

Problem Definition

Existing end-to-end autonomous driving methods primarily adopt a "one-shot paradigm," which directly predicts future trajectories spanning several seconds based on sensor observations at the current moment. This approach has the following key limitations:

  1. Static Scene Assumption: Excessively relies on current scene conditions to infer the ego vehicle's future motion, neglecting how the scene evolves over time—a critical factor
  2. Unidirectional Modeling: Fails to consider the impact of the ego vehicle's future behavior on the surrounding scene's evolution
  3. Lack of Temporal Dynamics Modeling: In dynamic interactive driving environments, this approach limits the model's adaptive decision-making capabilities

Research Motivation

The authors observe two important bidirectional dependencies:

  • Future traffic dynamics influence the ego vehicle's motion planning
  • The ego vehicle's planned behavior reciprocally shapes the future scene

Based on this insight, the authors propose the need to explicitly model the bidirectional interaction between scene evolution and trajectory planning.

Core Contributions

  1. Novel Paradigm: Proposes a new end-to-end driving paradigm that explicitly captures bidirectional interactions between scene dynamics and the ego vehicle's future behavior, challenging traditional one-shot planning approaches
  2. Unified Framework Design: Instantiates the SeerDrive framework, jointly modeling future BEV scene representations and vehicle trajectories through future-aware and iterative interaction mechanisms
  3. Performance Breakthrough: Achieves state-of-the-art performance on NAVSIM and nuScenes benchmarks, validating the design's effectiveness

Methodology Details

Task Definition

The end-to-end autonomous driving task maps sensor inputs (cameras and LiDAR) to future ego vehicle trajectories, typically using multimodal outputs to capture diverse possible futures. World models in autonomous driving aim to predict future scene evolution based on current observations.

Model Architecture

1. Feature Encoding

Given multi-view images I and LiDAR features P, the encoder transforms these multimodal sensor inputs into current BEV feature maps FbevcurrRH×W×CF^{curr}_{bev} \in \mathbb{R}^{H \times W \times C}:

F^curr_bev = TransFuser(I, P)
F^curr_ego = EgoEncoder(T, E)
B^curr = BEVDecoder(F^curr_bev)

where T represents anchored multimodal trajectories and E represents ego vehicle state.

2. Future BEV World Modeling

The BEV world model predicts future BEV representations using structured BEV representations rather than complex image generation:

F^fut_scene = BEVWorldModel(F^curr_scene)
B^fut = BEVDecoder(F^fut_bev)

3. Future-Aware End-to-End Planning

The planning network jointly reasons about current and future scenes to generate planned trajectories. A decoupled strategy is employed where ego features interact separately with current and future BEV features:

F^curr_ego = TransformerDecoder(F^curr_ego, F^curr_bev)
F^fut_ego = TransformerDecoder(F^fut_ego, F^fut_bev)
Ta = EgoDecoder(F^curr_ego)
Tb = EgoDecoder(F^fut_ego)

Final fusion is achieved through Motion-aware Layer Normalization (MLN):

F^curr_ego = MLN(F^curr_ego, F^fut_ego)
T^final = EgoDecoder(F^curr_ego)

4. Iterative Scene Modeling and Vehicle Planning

The BEV world modeling network and end-to-end planning network operate iteratively, progressively improving planning performance. After N iterations, N pairs of predicted future semantic maps and ego vehicle trajectories are produced.

Technical Innovations

  1. Bidirectional Modeling: First to explicitly model bidirectional dependencies between scene evolution and trajectory planning in end-to-end driving
  2. Decoupled Interaction Strategy: Avoids representation entanglement caused by direct interaction between current and future BEV features
  3. Iterative Optimization: Progressively refines scene prediction and trajectory generation through collaborative optimization
  4. Motion-Aware Fusion: Effectively fuses current and future ego representations using MLN

Experimental Setup

Datasets

  • NAVSIM: Built on nuPlan, containing 1,192 training/validation scenes and 136 test scenes, 8 cameras + LiDAR, 2Hz
  • nuScenes: 1,000 scenes, 6 cameras + LiDAR, 2Hz, using standard 700/150 training/validation split

Evaluation Metrics

  • NAVSIM: PDM Score (PDMS), including No Collision (NC), Drivable Area Compliance (DAC), Time-to-Collision (TTC), Comfort (Comf.), Ego Progress (EP)
  • nuScenes: L2 displacement error and collision rate

Implementation Details

  • NAVSIM: ResNet34 backbone, 3 viewpoints, 1024×256 resolution, 256 trajectory modes, 4-second planning horizon
  • nuScenes: ResNet50 backbone, 6 viewpoints, 640×360 resolution, 6 trajectory modes, 3-second planning horizon
  • Training: 8 RTX 3090 GPUs, AdamW optimizer

Experimental Results

Main Results

MethodNC ↑DAC ↑TTC ↑Comf. ↑EP ↑PDMS ↑
DiffusionDrive98.296.294.710082.288.1
WoTE98.596.894.999.981.988.3
Hydra-NeXt98.197.794.610081.888.6
SeerDrive98.497.094.999.983.288.9

SeerDrive achieves the highest PDMS score of 88.9 on NAVSIM, significantly outperforming existing methods.

nuScenes Dataset Performance Comparison

MethodL2 (m) ↓Col. Rate (%) ↓
1s/2s/3s/Avg.1s/2s/3s/Avg.
SparseDrive0.29/0.58/0.96/0.610.01/0.05/0.18/0.08
SeerDrive0.20/0.39/0.69/0.430.00/0.05/0.14/0.06

On nuScenes, SeerDrive achieves significant improvements in both displacement error and collision rate.

Ablation Studies

Core Component Analysis

Future-aware planIter. S&VPDMS ↑
87.1
87.9
88.1
88.9

Results demonstrate that both core components contribute significantly to performance improvement.

Iteration Count Analysis

IterationsPDMS ↑
188.1
288.9
388.7

Two iterations achieve the optimal balance between efficiency and performance.

Qualitative Results

The paper presents visualizations of right-turn and left-turn scenarios, demonstrating that the model can:

  • Accurately predict future BEV semantic maps
  • Generate planned trajectories highly consistent with ground truth
  • Capture multimodal possible future motions

End-to-End Autonomous Driving

  • Early Methods: Direct inference of trajectories or actions from sensor data
  • Unified Frameworks: UniAD unifies perception, prediction, and planning; VAD employs vectorized representations
  • Recent Progress: DiffusionDrive uses truncated diffusion strategies; DriveTransformer explores scaling laws

World Models in Autonomous Driving

  • Video Generation Methods: DriveDreamer, Drive-WM, etc., generate realistic videos
  • BEV Modeling: SLEDGE, GUMP, Scenario Dreamer, etc., model in BEV space
  • Joint Modeling: OccWorld, Drive-OccWorld, etc., jointly generate occupancy and actions

This paper differs from existing methods by achieving deep interaction between world modeling and planning.

Conclusions and Discussion

Main Conclusions

  1. Proposes a novel paradigm for bidirectional modeling of scene evolution and trajectory planning
  2. SeerDrive framework effectively implements future-aware end-to-end driving
  3. Achieves state-of-the-art performance on two benchmark datasets

Limitations

  1. Foundation Model Constraints: The BEV world model employs a specially designed transformer architecture, failing to leverage the generalization capabilities of foundation models
  2. Inference Speed: Using off-the-shelf foundation models as world models suffers from slow inference speed and joint optimization difficulties
  3. Complex Scene Handling: Failure cases persist in certain complex scenarios, such as incorrect lane selection and driving intent inference errors

Future Directions

  • Develop paradigms with tightly integrated planning and world modeling
  • Explore foundation model applications in end-to-end driving
  • Incorporate high-level driving intent to improve planning accuracy

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to systematically model bidirectional relationships between scene evolution and trajectory planning, breaking through traditional one-shot paradigms
  2. Reasonable Technical Design: Decoupled interaction strategies, iterative optimization, and other design choices effectively address practical challenges
  3. Comprehensive Experiments: Thorough evaluation across multiple datasets with detailed ablation studies
  4. Significant Performance Gains: Demonstrates clear improvements on challenging NAVSIM and nuScenes benchmarks

Weaknesses

  1. Computational Complexity: Iterative modeling increases computational overhead, requiring efficiency considerations for practical deployment
  2. Generalization Capability: Specially designed architecture may limit generalization across different scenarios
  3. Insufficient Failure Analysis: Deeper investigation into the root causes of model failures is needed

Impact

  1. Academic Contribution: Provides new research paradigms and perspectives for the end-to-end autonomous driving field
  2. Practical Value: Demonstrates good performance in real-world driving scenarios with application potential
  3. Reproducibility: Provides detailed implementation details and open-source code, facilitating reproduction and future research

Applicable Scenarios

  • Complex urban driving environments
  • Scenarios requiring multi-agent interaction consideration
  • Autonomous driving systems with high planning accuracy requirements
  • End-to-end learning research in autonomous driving

References

The paper cites 58 relevant references covering key works in end-to-end autonomous driving, world models, and joint modeling, providing a solid theoretical foundation for this research.


Overall Assessment: This is a high-quality autonomous driving research paper that proposes an innovative bidirectional modeling paradigm with well-designed technical solutions and comprehensive experimental evaluation. It achieves significant performance improvements on important benchmarks and opens new research directions for end-to-end autonomous driving, demonstrating both substantial academic value and practical significance.