2025-11-16T14:19:12.202113

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

Sun, Wang, Peng et al.

Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher's saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets.

academic

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

Basic Information

Paper ID: 2510.13565
Title: XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation
Authors: Huawei Sun, Zixu Wang, Xiangyuan Peng, Julius Ott, Georg Stettinger, Lorenzo Servadei, Robert Wille
Institutions: Technical University of Munich & Infineon Technologies AG
Category: cs.CV (Computer Vision)
Publication Date: October 15, 2025
Paper Link: https://arxiv.org/abs/2510.13565

Abstract

This paper proposes XD-RCDepth, a lightweight radar-camera depth estimation architecture that reduces parameters by 29.7% compared to state-of-the-art lightweight baselines while maintaining comparable accuracy. To preserve performance under model compression and enhance interpretability, the authors introduce two knowledge distillation strategies: explainability-aligned distillation (transferring saliency structures from teacher to student models) and depth distribution distillation (reformulating depth regression as soft classification over discretized bins). These components reduce MAE by 7.97% compared to direct training and achieve competitive accuracy with real-time efficiency on the nuScenes and ZJU-4DRadarCam datasets.

Research Background and Motivation

Problem Definition

Depth estimation remains a core task in autonomous driving. Existing methods primarily include:

Monocular Camera Methods: Suffer from inherent ill-posedness due to RGB images lacking direct geometric measurements
LiDAR-Camera Fusion: Achieves high accuracy but suffers from expensive LiDAR costs and large data bandwidth, affecting real-time performance
Radar-Camera Fusion: Radar is relatively cost-effective and more robust in adverse weather, but faces sparsity and noise challenges

Limitations of Existing Methods

Current radar-camera depth estimation methods exhibit the following problems:

High Computational Complexity: Most employ two-stage pipelines that first densify sparse radar point clouds, then predict depth
Flawed Distillation Design: Methods like LiRCDepth require channel alignment for cross-modal feature distillation, limiting student network design flexibility
Lack of Interpretability: Existing distillation signals are superficial, without addressing model interpretability

Research Motivation

The authors' research motivation stems from:

Developing more lightweight radar-camera fusion architectures to meet real-time deployment requirements
Designing more effective knowledge distillation strategies that maintain performance during model compression
Introducing interpretability into knowledge distillation for dense prediction tasks

Core Contributions

Proposes a Lightweight Radar-Camera Depth Estimation Framework: Employs efficient FiLM fusion modules with 29.7% fewer parameters than LiRCDepth
Innovative Knowledge Distillation Methods:
- Explainability-aligned saliency map distillation (X-KD)
- Depth distribution distillation (D2-KD)
First Introduction of Interpretability into Dense Prediction Knowledge Distillation: Generates saliency maps via Grad-CAM for distillation
Achieves Real-Time Performance: Maintains competitive accuracy while reaching 15 FPS

Methodology Details

Task Definition

Input: RGB image and sparse radar point cloud Output: Dense depth map Constraints: Real-time performance requirements and limited computational resources

Model Architecture

Teacher Network (CaFNet)

Image Stream: ResNet-34 backbone extracting features at 5 spatial scales
Radar Stream: Two-stage processing generating coarse depth and confidence maps
Fusion: Confidence-aware gated fusion (CaGF) module
Decoder: BTS-style decoder

Student Network (XD-RCDepth)

Backbone Network: Dual-modal MobileNetV2 processing image and radar features separately
FiLM Fusion Module:
```
γ = Conv1×1(fr), β = Conv1×1(fr)
ffuse = (1 + γ) ⊙ fi + β
```
where fr and fi are radar and image features respectively, γ and β are channel-wise scaling and offset coefficients
Point-wise DASPP: Extended dense atrous spatial pyramid pooling using pointwise convolution branches and dilated sampling at various dilation rates

Technical Innovations

1. Explainability-Aligned Distillation (X-KD)

Generates saliency maps via Grad-CAM to enable student networks to learn teacher network attention patterns:

Saliency Map Generation:

α(·)l,c = (1/HlWl) Σ Σ ∂φ(·)/∂F(·)l,c(i,j)
Map(·)l = ReLU(Σ α(·)l,c F(·)l,c)

Distillation Loss:

LX-KD = (1/|L|) Σ (1 - ⟨ãSl, ãTl⟩)

2. Depth Distribution Distillation (D2-KD)

Discretizes continuous depth ranges into B bins, performing distillation through soft classification:

Bin Assignment:

Δ(·)i(p) = |d(·)(p) - ci|, z(·)i(p) = -Δ(·)i(p)

Probability Distribution:

pS(p) = softmax(zS(p)/τ), qT(p) = softmax(zT(p)/τ)

KL Divergence Loss:

LD2-KD = (τ²/|Ω|) Σ Σ qTi(p) log(qTi(p)/pSi(p))

Overall Loss Function

L = λ1 LDepth + λ2 LX-KD + λ3 LD2-KD

where LDepth is depth supervision loss, λ1=1.0, λ2=0.5, λ3=0.5

Experimental Setup

Datasets

nuScenes: Multi-modal autonomous driving dataset with 3D radar data
ZJU-4DRadarCam: 4D radar dataset providing higher-resolution radar information

Evaluation Metrics

Error Metrics: MAE (Mean Absolute Error), RMSE (Root Mean Square Error), AbsRel (Relative Absolute Error), log10
Accuracy Metrics: δ1, δ2, δ3 (threshold accuracy)

Comparison Methods

RadarNet: Early radar-camera fusion method
CaFNet: Teacher network
LiRCDepth: Current state-of-the-art lightweight baseline

Implementation Details

Hardware: Single NVIDIA L40 GPU
Batch Size: 8
Distillation Layers: 1/16 scale layers of image encoder, radar encoder, and decoder

Experimental Results

Main Results

nuScenes Dataset Performance Comparison (80m evaluation distance)

Method	Parameters	Runtime	MAE↓	RMSE↓	AbsRel↓	δ1↑
RadarNet	22.8M	0.378s	2.179	4.899	0.106	0.894
CaFNet (Teacher)	62.25M	0.132s	1.763	4.184	0.083	0.921
LiRCDepth	12.65M	0.069s	2.152	4.801	0.105	0.892
XD-RCDepth (no distillation)	8.89M	0.015s	2.232	4.897	0.114	0.887
XD-RCDepth (XD2-KD)	8.89M	0.015s	2.054	4.676	0.102	0.901

Key Findings

Parameter Efficiency: XD-RCDepth reduces parameters by 29.7% compared to LiRCDepth
Speed Improvement: Runtime reduced from 0.069s to 0.015s, achieving 15 FPS
Distillation Effect: Compared to non-distilled version, MAE improvements of 7.91%, 7.96%, and 7.97% at 50m, 70m, and 80m distances respectively

Ablation Studies

Fusion Method Comparison

Fusion Method	Parameters	MAE	RMSE	AbsRel	δ1
Addition	8.74M	2.248	4.903	0.115	0.886
Concatenation	10.94M	2.208	4.802	0.114	0.888
Attention	9.48M	2.266	4.901	0.115	0.885
FiLM	8.89M	2.232	4.897	0.114	0.887

Distillation Component Analysis

X-KD	D2-KD	MAE	RMSE	AbsRel	δ1
-	-	2.232	4.897	0.114	0.887
✓	-	2.114	4.756	0.108	0.892
-	✓	2.132	4.781	0.107	0.891
✓	✓	2.054	4.676	0.102	0.901

Qualitative Analysis

Depth Map Quality: Distilled models produce clearer object boundaries and cleaner depth discontinuities
Saliency Map Alignment: Student networks trained with X-KD exhibit sharper saliency maps more focused on depth-relevant structures

Evolution of Depth Estimation Methods

Monocular Depth Estimation: Predicting dense depth maps from RGB images, but suffering from scale ambiguity
LiDAR-Camera Fusion: Leveraging sparse LiDAR point clouds as geometric priors
Radar-Camera Fusion: Utilizing more cost-effective and weather-robust millimeter-wave radar

Knowledge Distillation Development

Classical Distillation: Soft label distillation proposed by Hinton et al.
Feature Distillation: Intermediate layer feature alignment
Interpretability Distillation: First introduction in dense prediction tasks in this work

Advantages of This Work

Compared to existing work, this paper demonstrates significant improvements in lightweight design, real-time performance, and interpretability.

Conclusions and Discussion

Main Conclusions

Successful Lightweight Implementation: Significantly reduces parameters and computation time while maintaining competitive performance
Effective Distillation Strategies: X-KD and D2-KD complement each other, substantially improving student network performance
Practical Value: Achieves real-time performance requirements suitable for actual deployment

Limitations

Radar Data Quality Dependency: Performance remains constrained by radar point cloud sparsity and noise
Distillation Target Selection: Choice of Grad-CAM targets (e.g., image-level average depth) may affect results
Generalization Capability: Primarily validated on specific datasets; cross-domain generalization requires further investigation

Future Directions

The authors propose investigating the impact of Grad-CAM target selection and alternative attribution targets on distillation interpretability quality and downstream performance.

In-Depth Evaluation

Strengths

Strong Technical Innovation: First introduction of interpretability into dense prediction knowledge distillation with novel technical approach
Comprehensive Experiments: Thorough comparative and ablation studies on two datasets
High Practical Value: Significant parameter and speed optimization meeting actual deployment requirements
Reasonable Method Design: FiLM fusion is simple and effective; Point-wise DASPP demonstrates clever lightweight design

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why interpretability distillation is effective
Limited Ablation Studies: Insufficient analysis of different Grad-CAM targets and temperature parameter effects
Narrow Comparison Scope: Primarily compares with radar-camera methods; lacks comparison with other lightweight depth estimation approaches

Impact

Academic Contribution: Opens new directions for knowledge distillation in dense prediction tasks
Practical Value: Provides feasible solutions for real-time depth estimation in autonomous driving
Reproducibility: Clear method description with sufficient implementation details

Applicable Scenarios

Autonomous Driving: Real-time depth estimation in resource-constrained vehicle systems
Mobile Robotics: Scenarios requiring lightweight multi-modal perception
Edge Computing: Applications with limited computational resources but requiring accurate depth information

References

The paper cites important works in depth estimation, knowledge distillation, and explainable AI, including:

Hinton et al. (2015): Foundational work on knowledge distillation
Selvaraju et al. (2019): Grad-CAM visualization method
Caesar et al. (2020): nuScenes dataset
Multiple recent studies on radar-camera fusion

Overall Assessment: This is a high-quality technical paper making valuable contributions to lightweight multi-modal depth estimation. The methodology is novel, experiments are comprehensive, and practical value is prominent, providing beneficial references for related research and applications.