2025-11-16T14:19:12.202113

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

Sun, Wang, Peng et al.
Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher's saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets.
academic

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

Basic Information

  • Paper ID: 2510.13565
  • Title: XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation
  • Authors: Huawei Sun, Zixu Wang, Xiangyuan Peng, Julius Ott, Georg Stettinger, Lorenzo Servadei, Robert Wille
  • Institutions: Technical University of Munich & Infineon Technologies AG
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 15, 2025
  • Paper Link: https://arxiv.org/abs/2510.13565

Abstract

This paper proposes XD-RCDepth, a lightweight radar-camera depth estimation architecture that reduces parameters by 29.7% compared to state-of-the-art lightweight baselines while maintaining comparable accuracy. To preserve performance under model compression and enhance interpretability, the authors introduce two knowledge distillation strategies: explainability-aligned distillation (transferring saliency structures from teacher to student models) and depth distribution distillation (reformulating depth regression as soft classification over discretized bins). These components reduce MAE by 7.97% compared to direct training and achieve competitive accuracy with real-time efficiency on the nuScenes and ZJU-4DRadarCam datasets.

Research Background and Motivation

Problem Definition

Depth estimation remains a core task in autonomous driving. Existing methods primarily include:

  1. Monocular Camera Methods: Suffer from inherent ill-posedness due to RGB images lacking direct geometric measurements
  2. LiDAR-Camera Fusion: Achieves high accuracy but suffers from expensive LiDAR costs and large data bandwidth, affecting real-time performance
  3. Radar-Camera Fusion: Radar is relatively cost-effective and more robust in adverse weather, but faces sparsity and noise challenges

Limitations of Existing Methods

Current radar-camera depth estimation methods exhibit the following problems:

  1. High Computational Complexity: Most employ two-stage pipelines that first densify sparse radar point clouds, then predict depth
  2. Flawed Distillation Design: Methods like LiRCDepth require channel alignment for cross-modal feature distillation, limiting student network design flexibility
  3. Lack of Interpretability: Existing distillation signals are superficial, without addressing model interpretability

Research Motivation

The authors' research motivation stems from:

  1. Developing more lightweight radar-camera fusion architectures to meet real-time deployment requirements
  2. Designing more effective knowledge distillation strategies that maintain performance during model compression
  3. Introducing interpretability into knowledge distillation for dense prediction tasks

Core Contributions

  1. Proposes a Lightweight Radar-Camera Depth Estimation Framework: Employs efficient FiLM fusion modules with 29.7% fewer parameters than LiRCDepth
  2. Innovative Knowledge Distillation Methods:
    • Explainability-aligned saliency map distillation (X-KD)
    • Depth distribution distillation (D2-KD)
  3. First Introduction of Interpretability into Dense Prediction Knowledge Distillation: Generates saliency maps via Grad-CAM for distillation
  4. Achieves Real-Time Performance: Maintains competitive accuracy while reaching 15 FPS

Methodology Details

Task Definition

Input: RGB image and sparse radar point cloud Output: Dense depth map Constraints: Real-time performance requirements and limited computational resources

Model Architecture

Teacher Network (CaFNet)

  • Image Stream: ResNet-34 backbone extracting features at 5 spatial scales
  • Radar Stream: Two-stage processing generating coarse depth and confidence maps
  • Fusion: Confidence-aware gated fusion (CaGF) module
  • Decoder: BTS-style decoder

Student Network (XD-RCDepth)

  • Backbone Network: Dual-modal MobileNetV2 processing image and radar features separately
  • FiLM Fusion Module:
    γ = Conv1×1(fr), β = Conv1×1(fr)
    ffuse = (1 + γ) ⊙ fi + β
    

    where fr and fi are radar and image features respectively, γ and β are channel-wise scaling and offset coefficients
  • Point-wise DASPP: Extended dense atrous spatial pyramid pooling using pointwise convolution branches and dilated sampling at various dilation rates

Technical Innovations

1. Explainability-Aligned Distillation (X-KD)

Generates saliency maps via Grad-CAM to enable student networks to learn teacher network attention patterns:

Saliency Map Generation:

α(·)l,c = (1/HlWl) Σ Σ ∂φ(·)/∂F(·)l,c(i,j)
Map(·)l = ReLU(Σ α(·)l,c F(·)l,c)

Distillation Loss:

LX-KD = (1/|L|) Σ (1 - ⟨ãSl, ãTl⟩)

2. Depth Distribution Distillation (D2-KD)

Discretizes continuous depth ranges into B bins, performing distillation through soft classification:

Bin Assignment:

Δ(·)i(p) = |d(·)(p) - ci|, z(·)i(p) = -Δ(·)i(p)

Probability Distribution:

pS(p) = softmax(zS(p)/τ), qT(p) = softmax(zT(p)/τ)

KL Divergence Loss:

LD2-KD = (τ²/|Ω|) Σ Σ qTi(p) log(qTi(p)/pSi(p))

Overall Loss Function

L = λ1 LDepth + λ2 LX-KD + λ3 LD2-KD

where LDepth is depth supervision loss, λ1=1.0, λ2=0.5, λ3=0.5

Experimental Setup

Datasets

  1. nuScenes: Multi-modal autonomous driving dataset with 3D radar data
  2. ZJU-4DRadarCam: 4D radar dataset providing higher-resolution radar information

Evaluation Metrics

  • Error Metrics: MAE (Mean Absolute Error), RMSE (Root Mean Square Error), AbsRel (Relative Absolute Error), log10
  • Accuracy Metrics: δ1, δ2, δ3 (threshold accuracy)

Comparison Methods

  • RadarNet: Early radar-camera fusion method
  • CaFNet: Teacher network
  • LiRCDepth: Current state-of-the-art lightweight baseline

Implementation Details

  • Hardware: Single NVIDIA L40 GPU
  • Batch Size: 8
  • Distillation Layers: 1/16 scale layers of image encoder, radar encoder, and decoder

Experimental Results

Main Results

nuScenes Dataset Performance Comparison (80m evaluation distance)

MethodParametersRuntimeMAE↓RMSE↓AbsRel↓δ1↑
RadarNet22.8M0.378s2.1794.8990.1060.894
CaFNet (Teacher)62.25M0.132s1.7634.1840.0830.921
LiRCDepth12.65M0.069s2.1524.8010.1050.892
XD-RCDepth (no distillation)8.89M0.015s2.2324.8970.1140.887
XD-RCDepth (XD2-KD)8.89M0.015s2.0544.6760.1020.901

Key Findings

  1. Parameter Efficiency: XD-RCDepth reduces parameters by 29.7% compared to LiRCDepth
  2. Speed Improvement: Runtime reduced from 0.069s to 0.015s, achieving 15 FPS
  3. Distillation Effect: Compared to non-distilled version, MAE improvements of 7.91%, 7.96%, and 7.97% at 50m, 70m, and 80m distances respectively

Ablation Studies

Fusion Method Comparison

Fusion MethodParametersMAERMSEAbsRelδ1
Addition8.74M2.2484.9030.1150.886
Concatenation10.94M2.2084.8020.1140.888
Attention9.48M2.2664.9010.1150.885
FiLM8.89M2.2324.8970.1140.887

Distillation Component Analysis

X-KDD2-KDMAERMSEAbsRelδ1
--2.2324.8970.1140.887
-2.1144.7560.1080.892
-2.1324.7810.1070.891
2.0544.6760.1020.901

Qualitative Analysis

  1. Depth Map Quality: Distilled models produce clearer object boundaries and cleaner depth discontinuities
  2. Saliency Map Alignment: Student networks trained with X-KD exhibit sharper saliency maps more focused on depth-relevant structures

Evolution of Depth Estimation Methods

  1. Monocular Depth Estimation: Predicting dense depth maps from RGB images, but suffering from scale ambiguity
  2. LiDAR-Camera Fusion: Leveraging sparse LiDAR point clouds as geometric priors
  3. Radar-Camera Fusion: Utilizing more cost-effective and weather-robust millimeter-wave radar

Knowledge Distillation Development

  1. Classical Distillation: Soft label distillation proposed by Hinton et al.
  2. Feature Distillation: Intermediate layer feature alignment
  3. Interpretability Distillation: First introduction in dense prediction tasks in this work

Advantages of This Work

Compared to existing work, this paper demonstrates significant improvements in lightweight design, real-time performance, and interpretability.

Conclusions and Discussion

Main Conclusions

  1. Successful Lightweight Implementation: Significantly reduces parameters and computation time while maintaining competitive performance
  2. Effective Distillation Strategies: X-KD and D2-KD complement each other, substantially improving student network performance
  3. Practical Value: Achieves real-time performance requirements suitable for actual deployment

Limitations

  1. Radar Data Quality Dependency: Performance remains constrained by radar point cloud sparsity and noise
  2. Distillation Target Selection: Choice of Grad-CAM targets (e.g., image-level average depth) may affect results
  3. Generalization Capability: Primarily validated on specific datasets; cross-domain generalization requires further investigation

Future Directions

The authors propose investigating the impact of Grad-CAM target selection and alternative attribution targets on distillation interpretability quality and downstream performance.

In-Depth Evaluation

Strengths

  1. Strong Technical Innovation: First introduction of interpretability into dense prediction knowledge distillation with novel technical approach
  2. Comprehensive Experiments: Thorough comparative and ablation studies on two datasets
  3. High Practical Value: Significant parameter and speed optimization meeting actual deployment requirements
  4. Reasonable Method Design: FiLM fusion is simple and effective; Point-wise DASPP demonstrates clever lightweight design

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why interpretability distillation is effective
  2. Limited Ablation Studies: Insufficient analysis of different Grad-CAM targets and temperature parameter effects
  3. Narrow Comparison Scope: Primarily compares with radar-camera methods; lacks comparison with other lightweight depth estimation approaches

Impact

  1. Academic Contribution: Opens new directions for knowledge distillation in dense prediction tasks
  2. Practical Value: Provides feasible solutions for real-time depth estimation in autonomous driving
  3. Reproducibility: Clear method description with sufficient implementation details

Applicable Scenarios

  1. Autonomous Driving: Real-time depth estimation in resource-constrained vehicle systems
  2. Mobile Robotics: Scenarios requiring lightweight multi-modal perception
  3. Edge Computing: Applications with limited computational resources but requiring accurate depth information

References

The paper cites important works in depth estimation, knowledge distillation, and explainable AI, including:

  • Hinton et al. (2015): Foundational work on knowledge distillation
  • Selvaraju et al. (2019): Grad-CAM visualization method
  • Caesar et al. (2020): nuScenes dataset
  • Multiple recent studies on radar-camera fusion

Overall Assessment: This is a high-quality technical paper making valuable contributions to lightweight multi-modal depth estimation. The methodology is novel, experiments are comprehensive, and practical value is prominent, providing beneficial references for related research and applications.