2025-11-11T08:37:09.146501

VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting

Cho, Kang, Lee et al.
End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic, data-driven framework. However, achieving robustness to varying camera viewpoints, a common real-world challenge due to diverse vehicle configurations, remains an open problem. In this work, we propose VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further improve viewpoint consistency, we introduce a viewpoint-mixed memory bank that facilitates temporal interaction across multiple viewpoints and a viewpoint-consistent distillation strategy that transfers knowledge from original to synthesized views. Trained in a fully end-to-end manner, VR-Drive effectively mitigates synthesis-induced noise and improves planning under viewpoint shifts. In addition, we release a new benchmark dataset to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis. Our results demonstrate that VR-Drive is a scalable and robust solution for the real-world deployment of end-to-end autonomous driving systems.
academic

VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting

Basic Information

  • Paper ID: 2510.23205
  • Title: VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting
  • Authors: Hoonhee Cho, Jae-Young Kang, Giwon Lee, Hyemin Yang, Heejun Park, Seokwoo Jung, Kuk-Jin Yoon
  • Classification: cs.CV
  • Publication Time/Venue: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.23205

Abstract

End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic data-driven framework. However, achieving robustness across different camera viewpoints—a common practical challenge arising from vehicle configuration diversity—remains an open problem. This work proposes VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further enhance viewpoint consistency, a viewpoint-mixed memory bank is introduced to facilitate temporal interactions across multiple viewpoints, along with a viewpoint-consistent distillation strategy that transfers knowledge from original views to synthesized views. Through fully end-to-end training, VR-Drive effectively mitigates synthesis-induced noise and improves planning performance under viewpoint variations. Additionally, a new benchmark dataset is released to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis.

Research Background and Motivation

Problem Definition

Existing end-to-end autonomous driving systems face a critical challenge: performance degradation caused by camera viewpoint variations. In practical deployment, camera configurations differ significantly across different vehicle types and manufacturers, including variations in mounting height, angle, and position parameters.

Problem Significance

  1. Practical Requirements: Autonomous driving systems must adapt to various vehicle models without requiring retraining for each configuration
  2. Cost Considerations: Collecting annotated data for each camera configuration is prohibitively expensive and impractical
  3. Safety Requirements: Viewpoint changes may lead to perception failures; as shown in Figure 1, existing methods fail to detect vehicles ahead when camera height is lowered

Limitations of Existing Approaches

  1. Data Dependency: Requires collecting large amounts of annotated data for each camera configuration
  2. Scene-Specific: Existing novel view synthesis methods are typically optimized for specific scenes with high computational overhead
  3. Poor Generalization: Performance significantly degrades on out-of-distribution (OOD) data

Research Motivation

Propose an end-to-end autonomous driving framework that uses only a single camera configuration during training but maintains robustness to various unseen camera viewpoints during testing.

Core Contributions

  1. First Systematic Study: First systematic investigation of camera viewpoint robustness in end-to-end autonomous driving
  2. Unified Framework: Proposes VR-Drive, which jointly learns 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis
  3. Technical Innovations:
    • Viewpoint-Mixed Memory Bank enabling cross-viewpoint feature interaction
    • Viewpoint-Consistent Distillation strategy for knowledge transfer
  4. Benchmark Contribution: Constructs a new evaluation benchmark supporting E2E-AD performance assessment under novel camera viewpoints

Method Details

Task Definition

Input: Multi-view camera image sequences Output: Ego-vehicle motion planning trajectory Constraint: Training uses only original viewpoint data; testing requires robustness to unseen viewpoints

Model Architecture

VR-Drive comprises three main components:

1. Original-view Learning

  • Extracts multi-view feature maps using ResNet50: IRN×C×H×WI \in \mathbb{R}^{N×C×H×W}
  • Performs scene reconstruction based on feed-forward 3D Gaussian Splatting (3DGS)
  • Gaussian primitive definition: g=(μ,Σ,α,c)g = (μ, Σ, α, c), including position, covariance, opacity, and color

2. Novel-view Learning

  • Randomly samples camera extrinsics to generate novel viewpoints
  • Extracts novel-view features using shared encoder: I~RN×C×H×W\tilde{I} \in \mathbb{R}^{N×C×H×W}
  • Employs cyclic reconstruction loss to train the model to regenerate original viewpoints

3. Perception-planning Learning

  • Randomly selects original or novel viewpoint as input during training
  • Integrates 3D object detection and mapping tasks
  • Adopts sparse architecture for improved efficiency

Key Technical Components

Viewpoint-Mixed Memory Bank

F̃ = Cross-Attention(Query = F, Key = F', Value = F')
  • Stores and updates instance features from different viewpoints
  • Fuses current viewpoint and memory bank features through cross-attention mechanism
  • Employs FIFO strategy for updating high-confidence instances

Viewpoint-Consistent Distillation

Core idea: Use reliable features from original viewpoints to guide novel viewpoint feature learning

  1. Keypoint Sampling:
    p*_{i,j} = p_{i,j} + position(B_i)
    
  2. Feature Aggregation:
    S_i = Σ_n Σ_j w_{n,i,j} · f_{n,i,j}
    
  3. Distillation Loss:
    L_distill = 1/|I*| Σ_{i∈I*} ||S̃_i - stopgrad(S_i)||²_2
    

Loss Function

Total loss comprises multiple components:

L = L_det + L_map + L_depth + L_motion + L_plan + L_render

Where rendering loss includes:

  • Original Reconstruction Loss: Reconstructs adjacent timestep views
  • Cyclic Reconstruction Loss: Reconstructs original viewpoints from novel viewpoints

Experimental Setup

Datasets

  1. nuScenes: Widely-used autonomous driving benchmark dataset
  2. CARLA: Simulation environment for closed-loop evaluation
  3. New Benchmark: Viewpoint variation evaluation set constructed from nuScenes, containing 146 test sequences

Viewpoint Variation Configurations

Camera parameter changes introduced during testing:

  • Pitch angle: +5°, -10°
  • Height: +1.0m, -0.7m
  • Depth: +1.0m

Evaluation Metrics

  • L2 Distance: Average Displacement Error (ADE) over 1s/2s/3s time horizons
  • Collision Rate: Percentage of planning trajectories with collisions
  • Driving Score (DS) and Route Completion (RC): CARLA closed-loop evaluation metrics

Comparison Methods

  • AD-MLP
  • BEV-Planner
  • VAD
  • SparseDrive
  • DiffusionDrive

Experimental Results

Main Results

Open-loop planning performance comparison on nuScenes dataset:

Camera SettingMethodL2 Distance (m) ↓Collision Rate (%) ↓
OriginalDiffusionDrive0.570.08
OriginalVR-Drive0.600.06
Pitch -10°DiffusionDrive0.960.24
Pitch -10°VR-Drive0.700.11
Height +1.0mDiffusionDrive1.460.81
Height +1.0mVR-Drive0.690.11

Key Findings:

  • VR-Drive maintains competitive performance on original viewpoints
  • Significantly outperforms existing methods on novel viewpoints, reducing average L2 distance from 1.17m to 0.68m
  • Collision rate reduced from 0.41% to 0.11%

Ablation Study

ComponentOriginal L2↓Novel L2↓Original CR↓Novel CR↓
Baseline0.630.910.140.30
+Scene Reconstruction0.590.900.070.26
+Memory Bank0.620.730.090.17
+Cyclic Reconstruction0.590.680.090.16
+Distillation0.610.730.080.14
Complete Model0.600.680.060.11

Important Findings:

  1. Adding scene reconstruction alone improves original viewpoint performance
  2. Components work synergistically; complete model achieves best results
  3. No trade-off exists between original viewpoint performance and novel viewpoint robustness

CARLA Closed-loop Evaluation

Results on Town05-Nov benchmark:

MethodOriginal DSNovel Avg DSOriginal RCNovel Avg RC
BEV-Planner17.257.8028.7028.86
Baseline76.4748.2599.2094.87
VR-Drive84.0488.2599.0498.28

VR-Drive demonstrates excellent viewpoint robustness in closed-loop testing.

End-to-End Autonomous Driving

Existing research primarily follows two directions:

  1. Architecture and Task Exploration: Optimizing sub-modules to improve planning performance
  2. High-level Information Distillation: Leveraging expert knowledge from rules or reinforcement learning

Viewpoint-Robust Representations and Scene Reconstruction

  1. Early Research: Demonstrates neural network vulnerability to viewpoint changes
  2. Novel View Synthesis: NeRF and 3DGS-based methods, though mostly scene-specific optimization
  3. Feed-forward Methods: Generalizable approaches supporting real-time inference

This paper is the first to systematically study viewpoint robustness in E2E-AD.

Conclusions and Discussion

Main Conclusions

  1. VR-Drive successfully addresses the viewpoint robustness problem in E2E-AD
  2. Joint learning of 3D reconstruction as an auxiliary task significantly enhances system robustness
  3. Proposed technical components effectively mitigate synthesis noise and improve planning performance

Limitations

  1. Camera Calibration Dependency: Performance is affected by camera calibration accuracy
  2. Computational Overhead: 3D reconstruction introduces additional computational cost
  3. Evaluation Scope: Currently validated only within limited viewpoint variation ranges

Future Directions

  1. Improve robustness to camera calibration errors
  2. Optimize computational efficiency for real-time deployment
  3. Extend to larger ranges of viewpoint variations and sensor configurations

In-Depth Evaluation

Strengths

  1. Problem Significance: Addresses a critical challenge in practical deployment
  2. Method Innovation: Cleverly combines 3D reconstruction with E2E-AD with well-designed technical components
  3. Comprehensive Experiments: Includes both open-loop and closed-loop evaluation with detailed ablation studies
  4. Benchmark Contribution: Provides new evaluation standards for the field

Weaknesses

  1. Calibration Assumption: Assumes perfect camera calibration; real-world applications may have errors
  2. Viewpoint Range: Tested viewpoint variations are relatively limited
  3. Computational Analysis: Lacks detailed computational overhead analysis

Impact

  1. Academic Value: Pioneering study of viewpoint robustness in E2E-AD
  2. Practical Value: Directly addresses real-world problems in industrial deployment
  3. Reproducibility: Detailed method description facilitates follow-up research

Applicable Scenarios

  1. Multi-vehicle Deployment: Scenarios requiring rapid adaptation across different vehicle configurations
  2. Sensor Upgrades: System migration when vehicle sensor configurations change
  3. Cross-domain Applications: Adapting to vehicle standard differences across regions or countries

References

The paper cites 75 relevant references covering multiple domains including end-to-end autonomous driving, 3D reconstruction, and novel view synthesis, providing a solid theoretical foundation for this research.


Overall Assessment: This is a high-quality research paper that systematically addresses viewpoint robustness in end-to-end autonomous driving for the first time. The method design is sound, experiments are comprehensive, and the work has significant value for advancing practical applications of autonomous driving technology.