2025-11-23T02:55:16.956845

Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

Xu, Lin, Zhou et al.
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc
academic

Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion

Basic Information

  • Paper ID: 2510.13198
  • Title: Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion
  • Authors: Rongtao Xu, Jinzhou Lin, Jialei Zhou, Jiahua Dong, Changwei Wang, Ruisheng Wang, Li Guo, Shibiao Xu, Xiaodan Liang
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13198v1
  • Code Link: https://github.com/VitaLemonTea1/CIGOcc

Abstract

Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantic information from 2D images. Existing methods primarily improve performance through structural modifications (such as lightweight backbone networks and complex cascaded frameworks), with limited effectiveness. Few studies explore representation fusion perspectives, resulting in insufficient utilization of rich feature diversity in 2D images. Motivated by this, we propose CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. CIGOcc extracts segmentation, graphic, and depth features from input images and introduces a deformable multi-level fusion mechanism to fuse these three types of multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark.

Research Background and Motivation

Research Problem

The core problem addressed in this paper is camera-based 3D semantic scene completion (SSC), specifically how to accurately reconstruct occluded regions from 2D images while maintaining cross-camera geometric consistency.

Problem Significance

  1. Autonomous Driving Requirements: SSC is a critical solution for 3D perception in autonomous driving and robotics
  2. Cost-Effectiveness: Camera-based methods offer superior cost-effectiveness compared to sensors like LiDAR
  3. Technical Challenges: Accurate reconstruction of occluded regions and maintenance of geometric consistency remain technical bottlenecks

Limitations of Existing Methods

  1. Structural Optimization Limitations: Existing methods primarily focus on network architecture optimization, overlooking thorough exploration and utilization of image information
  2. Insufficient Feature Utilization: Emphasis on graphic features (position, size, color, shape) provides only partial semantic information
  3. Missing Multi-Level Fusion: Lack of research on enhancing model understanding of 2D images through multi-level representation fusion

Research Motivation

The authors argue that the core of 3D perception lies in understanding three-dimensional spatial relationships, requiring:

  • Depth Features: As low-level features, carrying distortion and depth information to enhance spatial relationship understanding
  • Segmentation Features: Leveraging strong semantic representation capabilities of large foundation models (e.g., SAM)
  • Complementary Fusion: Effective fusion of different-level features to enhance 2D image understanding

Core Contributions

  1. CIGOcc Framework: Proposes a novel two-stage framework utilizing multi-level representation fusion to address low target accuracy, achieving accurate 2D-to-3D reconstruction, particularly in distant scenes
  2. Deformable Multi-Level Fusion Mechanism: Introduces a new fusion mechanism that adaptively and effectively fuses depth and semantic information, ensuring more comprehensive and accurate 3D reconstruction
  3. State-of-the-Art Performance: Achieves state-of-the-art performance on camera-based SSC tasks, demonstrating effectiveness and robustness in complex real-world scenarios

Methodology Details

Task Definition

Input: Single RGB image I ∈ R^(C×H×W) Output: Semantic voxel map Y ∈ R^(C×X×Y×Z), where each voxel is classified into one of 20 semantic categories Objective: Infer complete 3D scene geometry and semantic information from 2D images

Model Architecture

CIGOcc employs a two-stage architecture:

Stage One: Deformable Multimodal Fusion Network (DMFNet)

  1. Feature Extraction:
    • Generate depth map D_i ∈ R^(C×H×W) using MobileStereoNet
    • Extract semantic features F_i ∈ R^(C×H×W) using Grounded-SAM
  2. Initial Voxel Space Construction:
    F_raw = DMF(F_i^(C×H×W), D_i^(C×H×W))
    

    where DMF is an improved fusion method based on LMSCNet
  3. Segmentation Head Prediction:
    F_seg = SegHead(F_raw)
    

Stage Two: Complementary Information Guided Voxel Generation Network (CIGNet)

  1. Image Feature Extraction: Extract features F_2D ∈ R^(×H×W×D) using ResNet50
  2. Deformable Cross-Attention:
    Q_s^3d = DCA(F_2D, Q_d)
    

    where Q_d is binary classification query obtained from stage one
  3. Deformable Self-Attention:
    V̂_s^3d = DSA(Q̂_s^3d, Q̂_s^3d)
    
  4. Knowledge Distillation Module:
    F_sem^2d = θ_s(F_2D)
    

Technical Innovations

  1. Multi-Level Feature Fusion: First systematic fusion of high-level segmentation features, mid-level graphic features, and low-level depth features
  2. Large Model Knowledge Distillation: Effective distillation of Grounded-SAM knowledge into occupancy prediction tasks
  3. Deformable Attention Mechanism: Employs deformable attention to process high-resolution images with reduced computational complexity
  4. Two-Stage Training Strategy: Phased optimization of different-level feature fusion

Experimental Setup

Dataset

SemanticKITTI Dataset:

  • Dense semantic occupancy annotations based on KITTI Odometry benchmark
  • Coverage: 0-51.2 meters forward, ±25.6 meters lateral, -2 to 4.4 meters height
  • Voxel grid: 256×256×32, resolution 0.2 meters/voxel
  • 20 semantic categories

Evaluation Metrics

  • Primary Metric: Mean Intersection over Union (mIoU)
  • Auxiliary Metrics: IoU, Precision, Recall
  • Special Evaluation: Small object performance, long-tail object performance

Comparison Methods

Includes LMSCNet, 3DSketch, AICNet, JS3C-Net, MonoScene, VoxFormer, OccFormer, SurroundOcc, TPVFormer, SparseOcc, MonoOcc, and other mainstream methods

Implementation Details

  • Hardware: 4×RTX 3090 GPUs
  • Training Time: 20 epochs per stage, total 4.5+4.5=9 hours
  • Pretrained Weights: ViT-H HQ-SAM for Grounded-SAM, MSNet3D SFDS for MobileStereoNet
  • Backbone Network: ResNet50

Experimental Results

Main Results

Performance comparison on SemanticKITTI test set:

MethodmIoUImprovement over VoxFormer-T
VoxFormer-T13.41%-
CIGOcc14.90%+1.49%

Key Performance Improvements:

  • Overall mIoU: 14.90% (SOTA)
  • Small Object Performance: +19.28% improvement
  • Long-Tail Object Performance: +35.20% improvement

Performance at Different Distance Ranges

Distance RangeCIGOcc mIoUVoxFormer-T mIoUImprovement
12.8m23.81%21.55%+2.26%
25.6m20.35%18.42%+1.93%
51.2m14.90%13.35%+1.55%

Ablation Study

ComponentmIoUImpact
Complete Model14.49%-
Without Semantic Auxiliary Loss14.10%-0.39%
Without Fusion Features13.85%-0.64%
Without Grounded-SAM13.63%-0.86%

Case Analysis

Qualitative results demonstrate CIGOcc's superior performance in:

  • More precise scene voxel segmentation
  • Fewer voxel overlaps
  • More accurate road prediction
  • Better recognition of small objects and long-tail categories

Semantic Scene Completion (SSC)

  • SSCNet: Processes sparse depth maps using 3D CNN
  • EsscNet: Integrates multi-scale features
  • VoxFormer: Adopts two-stage Transformer architecture

Camera-Based 3D Perception

  • Monocular Depth Estimation: Monodepth, Monodepth2
  • Detection Transformers: DETR models
  • Multi-View Methods: BEVFormer, etc.

3D Occupancy Prediction

  • Transformer Architectures: VoxFormer, FB-Occ
  • Feature Fusion: Bidirectional feature processing of LSS+BEVFormer

Conclusions and Discussion

Main Conclusions

  1. Effectiveness of Multi-Level Fusion: Systematic fusion of different-level features significantly improves performance
  2. Large Model Knowledge Transfer: Grounded-SAM knowledge successfully transfers to occupancy prediction tasks
  3. Computational Efficiency: Achieves SOTA performance while maintaining efficiency

Limitations

  1. Training Resources: Two-stage training increases training time (+1 hour)
  2. Memory Consumption: Increases GPU memory by 0.4G compared to baseline methods
  3. Model Dependency: Relies on pretrained weights from Grounded-SAM and MobileStereoNet

Future Directions

  1. End-to-End Optimization: Explore single-stage training strategies
  2. Multi-Modal Fusion: Incorporate other sensor information
  3. Real-Time Applications: Further optimize inference speed

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic approach to occupancy prediction from multi-level representation fusion perspective
  2. Sound Methodology: Clear theoretical analysis with well-articulated complementarity of different-level features
  3. Comprehensive Experiments: Thorough ablation and comparative experiments validate method effectiveness
  4. Outstanding Performance: Achieves SOTA on multiple metrics, particularly for small objects and long-tail categories

Weaknesses

  1. Computational Complexity: Two-stage training increases training complexity
  2. Strong Dependency: Heavily relies on pretrained large models
  3. Generalization Analysis: Lacks validation on other datasets
  4. Theoretical Analysis: Insufficient theoretical justification for why this fusion strategy is optimal

Impact

  1. Academic Value: Provides new research directions for occupancy prediction field
  2. Practical Value: Direct application potential in autonomous driving scenarios
  3. Reproducibility: Provides code and detailed implementation details

Applicable Scenarios

  1. Autonomous Driving: Vehicle environmental perception and path planning
  2. Robot Navigation: Indoor and outdoor environment understanding
  3. AR/VR Applications: 3D scene reconstruction and understanding
  4. Urban Planning: Vision-based 3D city modeling

References

This paper cites 46 relevant references, primarily covering:

  • Foundational semantic scene completion work (SSCNet, LMSCNet, etc.)
  • Transformer architecture applications (VoxFormer, BEVFormer, etc.)
  • Large-scale vision models (SAM, Grounded-SAM, etc.)
  • Depth estimation and 3D perception-related work

Summary: CIGOcc is an important contribution to the occupancy prediction field. Through innovative multi-level feature fusion strategy and large model knowledge distillation, it significantly improves performance while maintaining computational efficiency. This work provides new research directions for vision-based 3D perception and holds important academic and practical significance.