2025-11-22T23:16:16.841585

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

Zhang, Song, Li et al.

End-to-end autonomous driving methods aim to directly map raw sensor inputs to future driving actions such as planned trajectories, bypassing traditional modular pipelines. While these approaches have shown promise, they often operate under a one-shot paradigm that relies heavily on the current scene context, potentially underestimating the importance of scene dynamics and their temporal evolution. This limitation restricts the model's ability to make informed and adaptive decisions in complex driving scenarios. We propose a new perspective: the future trajectory of an autonomous vehicle is closely intertwined with the evolving dynamics of its environment, and conversely, the vehicle's own future states can influence how the surrounding scene unfolds. Motivated by this bidirectional relationship, we introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning in a closed-loop manner. Our method first predicts future bird's-eye view (BEV) representations to anticipate the dynamics of the surrounding scene, then leverages this foresight to generate future-context-aware trajectories. Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning, which refines both future scene prediction and trajectory generation through collaborative optimization. Extensive experiments on the NAVSIM and nuScenes benchmarks show that SeerDrive significantly outperforms existing state-of-the-art methods.

academic

Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution

基本信息

论文ID: 2510.11092
标题: Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution
作者: Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, Li Zhang
分类: cs.CV
发表会议: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
论文链接: https://arxiv.org/abs/2510.11092
代码链接: https://github.com/LogosRoboticsGroup/SeerDrive

摘要

端到端自动驾驶方法旨在直接将原始传感器输入映射到未来驾驶动作（如规划轨迹），绕过传统的模块化管道。虽然这些方法显示出前景，但它们通常在单次范式下运行，严重依赖当前场景上下文，可能低估了场景动态及其时间演化的重要性。这种限制约束了模型在复杂驾驶场景中做出明智和自适应决策的能力。本文提出了一个新视角：自动驾驶车辆的未来轨迹与其环境的演化动态密切相关，反之，车辆自身的未来状态也能影响周围场景的展开。基于这种双向关系，作者引入了SeerDrive，一个新颖的端到端框架，以闭环方式联合建模未来场景演化和轨迹规划。

研究背景与动机

问题定义

现有的端到端自动驾驶方法主要采用"单次范式"（one-shot paradigm），即基于当前时刻的传感器观测直接预测未来几秒的轨迹。这种方法存在以下关键问题：

静态场景假设：过度依赖当前场景情况来推断自车未来运动，忽视了场景如何随时间演化这一关键因素
单向建模：未考虑自车未来行为对周围场景展开的影响
缺乏时序动态建模：在动态交互驾驶环境中，这种方法限制了模型的适应性决策能力

研究动机

作者观察到两个重要的双向依赖关系：

未来交通动态影响自车的运动规划
自车的规划行为反过来塑造未来场景

基于这一洞察，作者提出需要显式建模场景演化与轨迹规划之间的双向交互关系。

核心贡献

新范式提出：提出了一种新的端到端驾驶范式，显式捕获场景动态与自车未来行为之间的双向交互，挑战了传统的单次规划方法
统一框架设计：实例化了SeerDrive框架，通过未来感知和迭代交互机制联合建模未来BEV场景表示和车辆轨迹
性能突破：在NAVSIM和nuScenes基准测试上实现了最先进的性能，验证了设计的有效性

方法详解

任务定义

端到端自动驾驶任务是将传感器输入（相机和LiDAR）映射到未来自车轨迹，通常使用多模态输出来捕获多样的可能未来。世界模型在自动驾驶中旨在基于当前观测预测未来场景演化。

模型架构

1. 特征编码

给定多视角图像I和LiDAR特征P，编码器将这些多模态传感器输入转换为当前BEV特征图 $F^{curr}_{bev} \in \mathbb{R}^{H \times W \times C}$ ：

F^curr_bev = TransFuser(I, P)
F^curr_ego = EgoEncoder(T, E)
B^curr = BEVDecoder(F^curr_bev)

其中T为锚定的多模态轨迹，E为自车状态。

2. 未来BEV世界建模

BEV世界模型预测未来BEV表示，采用结构化的BEV表示而非复杂的图像生成：

F^fut_scene = BEVWorldModel(F^curr_scene)
B^fut = BEVDecoder(F^fut_bev)

3. 未来感知的端到端规划

规划网络联合推理当前场景和未来演化来生成规划轨迹。采用解耦策略，自车特征分别与当前和未来BEV特征交互：

F^curr_ego = TransformerDecoder(F^curr_ego, F^curr_bev)
F^fut_ego = TransformerDecoder(F^fut_ego, F^fut_bev)
Ta = EgoDecoder(F^curr_ego)
Tb = EgoDecoder(F^fut_ego)

最终通过运动感知层归一化（MLN）融合：

F^curr_ego = MLN(F^curr_ego, F^fut_ego)
T^final = EgoDecoder(F^curr_ego)

4. 迭代场景建模与车辆规划

BEV世界建模网络和端到端规划网络以迭代方式运行，逐步改进规划性能。迭代N次，产生N对预测的未来语义图和自车轨迹。

技术创新点

双向建模：首次在端到端驾驶中显式建模场景演化与轨迹规划的双向依赖
解耦交互策略：避免当前和未来BEV特征直接交互导致的表示纠缠
迭代优化：通过协同优化逐步细化场景预测和轨迹生成
运动感知融合：使用MLN有效融合当前和未来自车表示

实验设置

数据集

NAVSIM：基于nuPlan构建，包含1,192个训练/验证场景和136个测试场景，8相机+LiDAR，2Hz
nuScenes：1,000个场景，6相机+LiDAR，2Hz，采用标准700/150训练/验证划分

评价指标

NAVSIM：PDM Score (PDMS)，包含无过失碰撞(NC)、可行驶区域合规性(DAC)、碰撞时间(TTC)、舒适度(Comf.)、自车进展(EP)
nuScenes：L2位移误差和碰撞率

实现细节

NAVSIM：ResNet34骨干网络，3视角，1024×256分辨率，256轨迹模式，4秒规划范围
nuScenes：ResNet50骨干网络，6视角，640×360分辨率，6轨迹模式，3秒规划范围
训练：8个RTX 3090 GPU，AdamW优化器

实验结果

主要结果

NAVSIM数据集性能对比

Method	NC ↑	DAC ↑	TTC ↑	Comf. ↑	EP ↑	PDMS ↑
DiffusionDrive	98.2	96.2	94.7	100	82.2	88.1
WoTE	98.5	96.8	94.9	99.9	81.9	88.3
Hydra-NeXt	98.1	97.7	94.6	100	81.8	88.6
SeerDrive	98.4	97.0	94.9	99.9	83.2	88.9