2025-11-11T08:37:09.146501

VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting

Cho, Kang, Lee et al.

End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic, data-driven framework. However, achieving robustness to varying camera viewpoints, a common real-world challenge due to diverse vehicle configurations, remains an open problem. In this work, we propose VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further improve viewpoint consistency, we introduce a viewpoint-mixed memory bank that facilitates temporal interaction across multiple viewpoints and a viewpoint-consistent distillation strategy that transfers knowledge from original to synthesized views. Trained in a fully end-to-end manner, VR-Drive effectively mitigates synthesis-induced noise and improves planning under viewpoint shifts. In addition, we release a new benchmark dataset to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis. Our results demonstrate that VR-Drive is a scalable and robust solution for the real-world deployment of end-to-end autonomous driving systems.

academic

VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting

Basic Information

Paper ID: 2510.23205
Title: VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting
Authors: Hoonhee Cho, Jae-Young Kang, Giwon Lee, Hyemin Yang, Heejun Park, Seokwoo Jung, Kuk-Jin Yoon
Classification: cs.CV
Publication Time/Venue: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2510.23205

Abstract

End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic data-driven framework. However, achieving robustness across different camera viewpoints—a common practical challenge arising from vehicle configuration diversity—remains an open problem. This work proposes VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further enhance viewpoint consistency, a viewpoint-mixed memory bank is introduced to facilitate temporal interactions across multiple viewpoints, along with a viewpoint-consistent distillation strategy that transfers knowledge from original views to synthesized views. Through fully end-to-end training, VR-Drive effectively mitigates synthesis-induced noise and improves planning performance under viewpoint variations. Additionally, a new benchmark dataset is released to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis.

Research Background and Motivation

Problem Definition

Existing end-to-end autonomous driving systems face a critical challenge: performance degradation caused by camera viewpoint variations. In practical deployment, camera configurations differ significantly across different vehicle types and manufacturers, including variations in mounting height, angle, and position parameters.

Problem Significance

Practical Requirements: Autonomous driving systems must adapt to various vehicle models without requiring retraining for each configuration
Cost Considerations: Collecting annotated data for each camera configuration is prohibitively expensive and impractical
Safety Requirements: Viewpoint changes may lead to perception failures; as shown in Figure 1, existing methods fail to detect vehicles ahead when camera height is lowered

Limitations of Existing Approaches

Data Dependency: Requires collecting large amounts of annotated data for each camera configuration
Scene-Specific: Existing novel view synthesis methods are typically optimized for specific scenes with high computational overhead
Poor Generalization: Performance significantly degrades on out-of-distribution (OOD) data

Research Motivation

Propose an end-to-end autonomous driving framework that uses only a single camera configuration during training but maintains robustness to various unseen camera viewpoints during testing.

Core Contributions

First Systematic Study: First systematic investigation of camera viewpoint robustness in end-to-end autonomous driving
Unified Framework: Proposes VR-Drive, which jointly learns 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis
Technical Innovations:
- Viewpoint-Mixed Memory Bank enabling cross-viewpoint feature interaction
- Viewpoint-Consistent Distillation strategy for knowledge transfer
Benchmark Contribution: Constructs a new evaluation benchmark supporting E2E-AD performance assessment under novel camera viewpoints

Method Details

Task Definition

Input: Multi-view camera image sequences Output: Ego-vehicle motion planning trajectory Constraint: Training uses only original viewpoint data; testing requires robustness to unseen viewpoints

Model Architecture

VR-Drive comprises three main components:

1. Original-view Learning

Extracts multi-view feature maps using ResNet50: $I \in \mathbb{R}^{N×C×H×W}$
Performs scene reconstruction based on feed-forward 3D Gaussian Splatting (3DGS)
Gaussian primitive definition: $g = (μ, Σ, α, c)$ , including position, covariance, opacity, and color

2. Novel-view Learning

Randomly samples camera extrinsics to generate novel viewpoints
Extracts novel-view features using shared encoder: $\tilde{I} \in \mathbb{R}^{N×C×H×W}$
Employs cyclic reconstruction loss to train the model to regenerate original viewpoints

3. Perception-planning Learning

Randomly selects original or novel viewpoint as input during training
Integrates 3D object detection and mapping tasks
Adopts sparse architecture for improved efficiency

Key Technical Components

Viewpoint-Mixed Memory Bank

F̃ = Cross-Attention(Query = F, Key = F', Value = F')

Stores and updates instance features from different viewpoints
Fuses current viewpoint and memory bank features through cross-attention mechanism
Employs FIFO strategy for updating high-confidence instances

Viewpoint-Consistent Distillation

Core idea: Use reliable features from original viewpoints to guide novel viewpoint feature learning

Keypoint Sampling:
```
p*_{i,j} = p_{i,j} + position(B_i)
```
Feature Aggregation:
```
S_i = Σ_n Σ_j w_{n,i,j} · f_{n,i,j}
```

Distillation Loss:

L_distill = 1/|I*| Σ_{i∈I*} ||S̃_i - stopgrad(S_i)||²_2

Loss Function

Total loss comprises multiple components:

L = L_det + L_map + L_depth + L_motion + L_plan + L_render

Where rendering loss includes:

Original Reconstruction Loss: Reconstructs adjacent timestep views
Cyclic Reconstruction Loss: Reconstructs original viewpoints from novel viewpoints

Experimental Setup

Datasets

nuScenes: Widely-used autonomous driving benchmark dataset
CARLA: Simulation environment for closed-loop evaluation
New Benchmark: Viewpoint variation evaluation set constructed from nuScenes, containing 146 test sequences

Viewpoint Variation Configurations

Camera parameter changes introduced during testing:

Pitch angle: +5°, -10°
Height: +1.0m, -0.7m
Depth: +1.0m

Evaluation Metrics

L2 Distance: Average Displacement Error (ADE) over 1s/2s/3s time horizons
Collision Rate: Percentage of planning trajectories with collisions
Driving Score (DS) and Route Completion (RC): CARLA closed-loop evaluation metrics

Comparison Methods

AD-MLP
BEV-Planner
VAD
SparseDrive
DiffusionDrive

Experimental Results

Main Results

Open-loop planning performance comparison on nuScenes dataset:

Camera Setting	Method	L2 Distance (m) ↓	Collision Rate (%) ↓
Original	DiffusionDrive	0.57	0.08
Original	VR-Drive	0.60	0.06
Pitch -10°	DiffusionDrive	0.96	0.24
Pitch -10°	VR-Drive	0.70	0.11
Height +1.0m	DiffusionDrive	1.46	0.81
Height +1.0m	VR-Drive	0.69	0.11

Key Findings:

VR-Drive maintains competitive performance on original viewpoints
Significantly outperforms existing methods on novel viewpoints, reducing average L2 distance from 1.17m to 0.68m
Collision rate reduced from 0.41% to 0.11%

Ablation Study

Component	Original L2↓	Novel L2↓	Original CR↓	Novel CR↓
Baseline	0.63	0.91	0.14	0.30
+Scene Reconstruction	0.59	0.90	0.07	0.26
+Memory Bank	0.62	0.73	0.09	0.17
+Cyclic Reconstruction	0.59	0.68	0.09	0.16
+Distillation	0.61	0.73	0.08	0.14
Complete Model	0.60	0.68	0.06	0.11

Important Findings:

Adding scene reconstruction alone improves original viewpoint performance
Components work synergistically; complete model achieves best results
No trade-off exists between original viewpoint performance and novel viewpoint robustness

CARLA Closed-loop Evaluation

Results on Town05-Nov benchmark:

Method	Original DS	Novel Avg DS	Original RC	Novel Avg RC
BEV-Planner	17.25	7.80	28.70	28.86
Baseline	76.47	48.25	99.20	94.87
VR-Drive	84.04	88.25	99.04	98.28

VR-Drive demonstrates excellent viewpoint robustness in closed-loop testing.

End-to-End Autonomous Driving

Existing research primarily follows two directions:

Architecture and Task Exploration: Optimizing sub-modules to improve planning performance
High-level Information Distillation: Leveraging expert knowledge from rules or reinforcement learning

Viewpoint-Robust Representations and Scene Reconstruction

Early Research: Demonstrates neural network vulnerability to viewpoint changes
Novel View Synthesis: NeRF and 3DGS-based methods, though mostly scene-specific optimization
Feed-forward Methods: Generalizable approaches supporting real-time inference

This paper is the first to systematically study viewpoint robustness in E2E-AD.

Conclusions and Discussion

Main Conclusions

VR-Drive successfully addresses the viewpoint robustness problem in E2E-AD
Joint learning of 3D reconstruction as an auxiliary task significantly enhances system robustness
Proposed technical components effectively mitigate synthesis noise and improve planning performance

Limitations

Camera Calibration Dependency: Performance is affected by camera calibration accuracy
Computational Overhead: 3D reconstruction introduces additional computational cost
Evaluation Scope: Currently validated only within limited viewpoint variation ranges

Future Directions

Improve robustness to camera calibration errors
Optimize computational efficiency for real-time deployment
Extend to larger ranges of viewpoint variations and sensor configurations

In-Depth Evaluation

Strengths

Problem Significance: Addresses a critical challenge in practical deployment
Method Innovation: Cleverly combines 3D reconstruction with E2E-AD with well-designed technical components
Comprehensive Experiments: Includes both open-loop and closed-loop evaluation with detailed ablation studies
Benchmark Contribution: Provides new evaluation standards for the field

Weaknesses

Calibration Assumption: Assumes perfect camera calibration; real-world applications may have errors
Viewpoint Range: Tested viewpoint variations are relatively limited
Computational Analysis: Lacks detailed computational overhead analysis

Impact

Academic Value: Pioneering study of viewpoint robustness in E2E-AD
Practical Value: Directly addresses real-world problems in industrial deployment
Reproducibility: Detailed method description facilitates follow-up research

Applicable Scenarios

Multi-vehicle Deployment: Scenarios requiring rapid adaptation across different vehicle configurations
Sensor Upgrades: System migration when vehicle sensor configurations change
Cross-domain Applications: Adapting to vehicle standard differences across regions or countries

References

The paper cites 75 relevant references covering multiple domains including end-to-end autonomous driving, 3D reconstruction, and novel view synthesis, providing a solid theoretical foundation for this research.

Overall Assessment: This is a high-quality research paper that systematically addresses viewpoint robustness in end-to-end autonomous driving for the first time. The method design is sound, experiments are comprehensive, and the work has significant value for advancing practical applications of autonomous driving technology.