2025-11-23T02:55:16.956845

Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

Xu, Lin, Zhou et al.

Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc

academic

Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

Basic Information

Paper ID: 2510.13198
Title: Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion
Authors: Rongtao Xu, Jinzhou Lin, Jialei Zhou, Jiahua Dong, Changwei Wang, Ruisheng Wang, Li Guo, Shibiao Xu, Xiaodan Liang
Category: cs.CV (Computer Vision)
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13198v1
Code Link: https://github.com/VitaLemonTea1/CIGOcc

Abstract

Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantic information from 2D images. Existing methods primarily improve performance through structural modifications (such as lightweight backbone networks and complex cascaded frameworks), with limited effectiveness. Few studies explore representation fusion perspectives, resulting in insufficient utilization of rich feature diversity in 2D images. Motivated by this, we propose CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. CIGOcc extracts segmentation, graphic, and depth features from input images and introduces a deformable multi-level fusion mechanism to fuse these three types of multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark.

Research Background and Motivation

Research Problem

The core problem addressed in this paper is camera-based 3D semantic scene completion (SSC), specifically how to accurately reconstruct occluded regions from 2D images while maintaining cross-camera geometric consistency.

Problem Significance

Autonomous Driving Requirements: SSC is a critical solution for 3D perception in autonomous driving and robotics
Cost-Effectiveness: Camera-based methods offer superior cost-effectiveness compared to sensors like LiDAR
Technical Challenges: Accurate reconstruction of occluded regions and maintenance of geometric consistency remain technical bottlenecks

Limitations of Existing Methods

Structural Optimization Limitations: Existing methods primarily focus on network architecture optimization, overlooking thorough exploration and utilization of image information
Insufficient Feature Utilization: Emphasis on graphic features (position, size, color, shape) provides only partial semantic information
Missing Multi-Level Fusion: Lack of research on enhancing model understanding of 2D images through multi-level representation fusion

Research Motivation

The authors argue that the core of 3D perception lies in understanding three-dimensional spatial relationships, requiring:

Depth Features: As low-level features, carrying distortion and depth information to enhance spatial relationship understanding
Segmentation Features: Leveraging strong semantic representation capabilities of large foundation models (e.g., SAM)
Complementary Fusion: Effective fusion of different-level features to enhance 2D image understanding

Core Contributions

CIGOcc Framework: Proposes a novel two-stage framework utilizing multi-level representation fusion to address low target accuracy, achieving accurate 2D-to-3D reconstruction, particularly in distant scenes
Deformable Multi-Level Fusion Mechanism: Introduces a new fusion mechanism that adaptively and effectively fuses depth and semantic information, ensuring more comprehensive and accurate 3D reconstruction
State-of-the-Art Performance: Achieves state-of-the-art performance on camera-based SSC tasks, demonstrating effectiveness and robustness in complex real-world scenarios

Methodology Details

Task Definition

Input: Single RGB image I ∈ R^(C×H×W) Output: Semantic voxel map Y ∈ R^(C×X×Y×Z), where each voxel is classified into one of 20 semantic categories Objective: Infer complete 3D scene geometry and semantic information from 2D images

Model Architecture

CIGOcc employs a two-stage architecture:

Stage One: Deformable Multimodal Fusion Network (DMFNet)

Feature Extraction:
- Generate depth map D_i ∈ R^(C×H×W) using MobileStereoNet
- Extract semantic features F_i ∈ R^(C×H×W) using Grounded-SAM
Initial Voxel Space Construction:
```
F_raw = DMF(F_i^(C×H×W), D_i^(C×H×W))
```
where DMF is an improved fusion method based on LMSCNet
Segmentation Head Prediction:
```
F_seg = SegHead(F_raw)
```

Stage Two: Complementary Information Guided Voxel Generation Network (CIGNet)

Image Feature Extraction: Extract features F_2D ∈ R^(×H×W×D) using ResNet50
Deformable Cross-Attention:
```
Q_s^3d = DCA(F_2D, Q_d)
```
where Q_d is binary classification query obtained from stage one
Deformable Self-Attention:
```
V̂_s^3d = DSA(Q̂_s^3d, Q̂_s^3d)
```
Knowledge Distillation Module:
```
F_sem^2d = θ_s(F_2D)
```

Technical Innovations

Multi-Level Feature Fusion: First systematic fusion of high-level segmentation features, mid-level graphic features, and low-level depth features
Large Model Knowledge Distillation: Effective distillation of Grounded-SAM knowledge into occupancy prediction tasks
Deformable Attention Mechanism: Employs deformable attention to process high-resolution images with reduced computational complexity
Two-Stage Training Strategy: Phased optimization of different-level feature fusion

Experimental Setup

Dataset

SemanticKITTI Dataset:

Dense semantic occupancy annotations based on KITTI Odometry benchmark
Coverage: 0-51.2 meters forward, ±25.6 meters lateral, -2 to 4.4 meters height
Voxel grid: 256×256×32, resolution 0.2 meters/voxel
20 semantic categories

Evaluation Metrics

Primary Metric: Mean Intersection over Union (mIoU)
Auxiliary Metrics: IoU, Precision, Recall
Special Evaluation: Small object performance, long-tail object performance

Comparison Methods

Includes LMSCNet, 3DSketch, AICNet, JS3C-Net, MonoScene, VoxFormer, OccFormer, SurroundOcc, TPVFormer, SparseOcc, MonoOcc, and other mainstream methods

Implementation Details

Hardware: 4×RTX 3090 GPUs
Training Time: 20 epochs per stage, total 4.5+4.5=9 hours
Pretrained Weights: ViT-H HQ-SAM for Grounded-SAM, MSNet3D SFDS for MobileStereoNet
Backbone Network: ResNet50

Experimental Results

Main Results

Performance comparison on SemanticKITTI test set:

Method	mIoU	Improvement over VoxFormer-T
VoxFormer-T	13.41%	-
CIGOcc	14.90%	+1.49%

Key Performance Improvements:

Overall mIoU: 14.90% (SOTA)
Small Object Performance: +19.28% improvement
Long-Tail Object Performance: +35.20% improvement

Performance at Different Distance Ranges

Distance Range	CIGOcc mIoU	VoxFormer-T mIoU	Improvement
12.8m	23.81%	21.55%	+2.26%
25.6m	20.35%	18.42%	+1.93%
51.2m	14.90%	13.35%	+1.55%

Ablation Study

Component	mIoU	Impact
Complete Model	14.49%	-
Without Semantic Auxiliary Loss	14.10%	-0.39%
Without Fusion Features	13.85%	-0.64%
Without Grounded-SAM	13.63%	-0.86%

Case Analysis

Qualitative results demonstrate CIGOcc's superior performance in:

More precise scene voxel segmentation
Fewer voxel overlaps
More accurate road prediction
Better recognition of small objects and long-tail categories

Semantic Scene Completion (SSC)

SSCNet: Processes sparse depth maps using 3D CNN
EsscNet: Integrates multi-scale features
VoxFormer: Adopts two-stage Transformer architecture

Camera-Based 3D Perception

Monocular Depth Estimation: Monodepth, Monodepth2
Detection Transformers: DETR models
Multi-View Methods: BEVFormer, etc.

3D Occupancy Prediction

Transformer Architectures: VoxFormer, FB-Occ
Feature Fusion: Bidirectional feature processing of LSS+BEVFormer

Conclusions and Discussion

Main Conclusions

Effectiveness of Multi-Level Fusion: Systematic fusion of different-level features significantly improves performance
Large Model Knowledge Transfer: Grounded-SAM knowledge successfully transfers to occupancy prediction tasks
Computational Efficiency: Achieves SOTA performance while maintaining efficiency

Limitations

Training Resources: Two-stage training increases training time (+1 hour)
Memory Consumption: Increases GPU memory by 0.4G compared to baseline methods
Model Dependency: Relies on pretrained weights from Grounded-SAM and MobileStereoNet

Future Directions

End-to-End Optimization: Explore single-stage training strategies
Multi-Modal Fusion: Incorporate other sensor information
Real-Time Applications: Further optimize inference speed

In-Depth Evaluation

Strengths

Strong Innovation: First systematic approach to occupancy prediction from multi-level representation fusion perspective
Sound Methodology: Clear theoretical analysis with well-articulated complementarity of different-level features
Comprehensive Experiments: Thorough ablation and comparative experiments validate method effectiveness
Outstanding Performance: Achieves SOTA on multiple metrics, particularly for small objects and long-tail categories

Weaknesses

Computational Complexity: Two-stage training increases training complexity
Strong Dependency: Heavily relies on pretrained large models
Generalization Analysis: Lacks validation on other datasets
Theoretical Analysis: Insufficient theoretical justification for why this fusion strategy is optimal

Impact

Academic Value: Provides new research directions for occupancy prediction field
Practical Value: Direct application potential in autonomous driving scenarios
Reproducibility: Provides code and detailed implementation details

Applicable Scenarios

Autonomous Driving: Vehicle environmental perception and path planning
Robot Navigation: Indoor and outdoor environment understanding
AR/VR Applications: 3D scene reconstruction and understanding
Urban Planning: Vision-based 3D city modeling

References

This paper cites 46 relevant references, primarily covering:

Foundational semantic scene completion work (SSCNet, LMSCNet, etc.)
Transformer architecture applications (VoxFormer, BEVFormer, etc.)
Large-scale vision models (SAM, Grounded-SAM, etc.)
Depth estimation and 3D perception-related work

Summary: CIGOcc is an important contribution to the occupancy prediction field. Through innovative multi-level feature fusion strategy and large model knowledge distillation, it significantly improves performance while maintaining computational efficiency. This work provides new research directions for vision-based 3D perception and holds important academic and practical significance.