Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion
Xu, Lin, Zhou et al.
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc
academic
Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantic information from 2D images. Existing methods primarily improve performance through structural modifications (such as lightweight backbone networks and complex cascaded frameworks), with limited effectiveness. Few studies explore representation fusion perspectives, resulting in insufficient utilization of rich feature diversity in 2D images. Motivated by this, we propose CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. CIGOcc extracts segmentation, graphic, and depth features from input images and introduces a deformable multi-level fusion mechanism to fuse these three types of multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark.
The core problem addressed in this paper is camera-based 3D semantic scene completion (SSC), specifically how to accurately reconstruct occluded regions from 2D images while maintaining cross-camera geometric consistency.
Structural Optimization Limitations: Existing methods primarily focus on network architecture optimization, overlooking thorough exploration and utilization of image information
Insufficient Feature Utilization: Emphasis on graphic features (position, size, color, shape) provides only partial semantic information
Missing Multi-Level Fusion: Lack of research on enhancing model understanding of 2D images through multi-level representation fusion
CIGOcc Framework: Proposes a novel two-stage framework utilizing multi-level representation fusion to address low target accuracy, achieving accurate 2D-to-3D reconstruction, particularly in distant scenes
Deformable Multi-Level Fusion Mechanism: Introduces a new fusion mechanism that adaptively and effectively fuses depth and semantic information, ensuring more comprehensive and accurate 3D reconstruction
State-of-the-Art Performance: Achieves state-of-the-art performance on camera-based SSC tasks, demonstrating effectiveness and robustness in complex real-world scenarios
Input: Single RGB image I ∈ R^(C×H×W)
Output: Semantic voxel map Y ∈ R^(C×X×Y×Z), where each voxel is classified into one of 20 semantic categories
Objective: Infer complete 3D scene geometry and semantic information from 2D images
Summary: CIGOcc is an important contribution to the occupancy prediction field. Through innovative multi-level feature fusion strategy and large model knowledge distillation, it significantly improves performance while maintaining computational efficiency. This work provides new research directions for vision-based 3D perception and holds important academic and practical significance.