2025-11-12T10:52:10.099968

Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation

Sui, Lichau, LefÃ¨vre et al.

Recent studies of multimodal industrial anomaly detection (IAD) based on 3D point clouds and RGB images have highlighted the importance of exploiting the redundancy and complementarity among modalities for accurate classification and segmentation. However, achieving multimodal IAD in practical production lines remains a work in progress. It is essential to consider the trade-offs between the costs and benefits associated with the introduction of new modalities while ensuring compatibility with current processes. Existing quality control processes combine rapid in-line inspections, such as optical and infrared imaging with high-resolution but time-consuming near-line characterization techniques, including industrial CT and electron microscopy to manually or semi-automatically locate and analyze defects in the production of Li-ion batteries and composite materials. Given the cost and time limitations, only a subset of the samples can be inspected by all in-line and near-line methods, and the remaining samples are only evaluated through one or two forms of in-line inspection. To fully exploit data for deep learning-driven automatic defect detection, the models must have the ability to leverage multimodal training and handle incomplete modalities during inference. In this paper, we propose CMDIAD, a Cross-Modal Distillation framework for IAD to demonstrate the feasibility of a Multi-modal Training, Few-modal Inference (MTFI) pipeline. Our findings show that the MTFI pipeline can more effectively utilize incomplete multimodal information compared to applying only a single modality for training and inference. Moreover, we investigate the reasons behind the asymmetric performance improvement using point clouds or RGB images as the main modality of inference. This provides a foundation for our future multimodal dataset construction with additional modalities from manufacturing scenarios.

academic

Basic Information

Paper ID: 2405.13571
Title: Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation
Authors: Wenbo Sui, Daniel Lichau, Josselin Lefèvre, Harold Phelippeau
Category: cs.CV
Published Journal: Information Fusion 126 (2026) 103572
Paper Link: https://arxiv.org/abs/2405.13571
Code Link: https://github.com/evenrose/CMDIAD

Abstract

This paper addresses a practical challenge in industrial anomaly detection: in real production lines, complete multimodal detection of all samples is infeasible due to cost and time constraints. The authors propose the CMDIAD framework, which implements a multimodal training with few-modal inference (MTFI) pipeline. Through cross-modal knowledge distillation techniques, the model can leverage complete multimodal data during training while achieving superior performance using only partial modalities during inference.

Research Background and Motivation

Problem Definition

In industrial anomaly detection, existing multimodal methods typically require complete modal information during both training and inference. However, in real production environments:

Cost Constraints: High-resolution detection technologies (e.g., industrial CT, electron microscopy) are expensive and time-consuming
Practical Limitations: Only a subset of samples can undergo full-modal detection, while most samples can only be assessed through 1-2 rapid online detection methods
Insufficient Data Utilization: Existing methods cannot fully leverage multimodal information from the training phase to improve single-modal inference performance

Research Significance

This problem is critical in practical industrial scenarios such as lithium battery and composite material production. Solving it enables:

Reduction of quality control costs
Improved detection efficiency
Full utilization of limited multimodal training data

Limitations of Existing Methods

Complete Modal Dependency: Existing multimodal IAD methods require complete modalities during both training and inference
Limited Missing Modal Handling: Few studies address missing modalities, primarily employing simple late fusion strategies
Information Waste: Unable to leverage multimodal information from training to improve single-modal inference performance

Core Contributions

First Incomplete Multimodal IAD: To the authors' knowledge, this is the first work addressing industrial anomaly detection with incomplete multimodal data
CMDIAD Framework: Proposes a novel multimodal IAD framework based on cross-modal distillation, enabling multimodal training with few-modal inference
MTFI Pipeline: Demonstrates the feasibility and effectiveness of the multimodal training, few-modal inference pipeline
Modal Correlation Analysis: Provides in-depth analysis of information transfer mechanisms between different modalities, offering guidance for future dataset construction

Methodology Details

Task Definition

Input: Paired RGB images and 3D point clouds during training; single modality (RGB or point cloud) during inference
Output: Image-level and pixel-level anomaly detection results
Objective: Enable single-modal inference performance to exceed baseline methods using only that modality for training and inference

Model Architecture

1. Feature Extraction Module

RGB Feature Extraction: Uses pre-trained DINO ViT-B/8 to extract RGB features, outputting dimension R^(2Hf×2Wf×d1)
Point Cloud Feature Extraction: Uses Point-MAE to extract point cloud features, obtaining RGB-aligned feature maps through FPS sampling and IDW interpolation

Proposes three distillation pathways:

Feature-to-Feature (F2F):

H^f_RGB^(i,j) = F2F(R^(i,j)_PC)

Uses three-layer MLP to establish direct mapping from feature space to feature space.

Feature-to-Input (F2I):

H^f_RGB = ℱ_RGB(H^i_RGB), H^i_RGB = F2I(R_PC)

Generates another modality's input from one modality's features.

Input-to-Feature (I2F):

H^f_RGB = I2F(I_PC)

Directly generates target modality features from input.

3. Memory Bank Construction

Uses greedy algorithm for coreset selection:

p_{i+1} = arg max_{p_j∈S,i≠j} D_c(p_i, p_j)

Employs sparse random projection for dimensionality reduction to improve computational efficiency.

4. Decision Layer Fusion

Uses two one-class support vector machines for classification and segmentation:

c = C_c(αψ(F_PC, M_PC), βψ(F_RGB, M_RGB))
s = C_s(αφ(F_PC, M_PC), βφ(F_RGB, M_RGB))

Technical Innovations

Cross-Modal Hallucination Generation: Learns cross-modal mappings to generate "hallucinated" features of missing modalities during inference
Multi-Path Distillation Strategy: Provides three distillation methods at different levels, balancing computational complexity and performance
Asymmetric Performance Analysis: Provides in-depth analysis of performance differences across different distillation directions and their underlying causes

Experimental Setup

Datasets

MVTec 3D-AD: Contains 10 object classes, 3-5 defect types per class, with pixel-level binary annotations
Eyecandies: Synthetic RGB+3D anomaly detection dataset

Evaluation Metrics

I-AUROC: Area under ROC curve for image-level anomaly detection
P-AUROC: Area under ROC curve for pixel-level anomaly detection
AUPRO: Average area under per-region-overlap curve, reducing impact of anomaly size on evaluation

Comparison Methods

DualBanksPCs/RGB: Dual memory bank method using single modality only
Shape-guided: SOTA method specifically designed for point clouds
M3DM: Multimodal memory bank method
AST: Asymmetric student-teacher network

Implementation Details

Optimizer: Adam, batch size 32, 10 warmup epochs
Learning Rates: 0.0005 for F2F and F2I, 0.0003 for I2F
Training Epochs: 100 with early stopping based on validation set
Hardware: NVIDIA RTX A6000, 256GB memory

Experimental Results

Main Results

MTFI Pipeline (Point Cloud Inference) Performance:

F2F method achieves I-AUROC 0.938, AUPRO 0.934 on MVTec 3D-AD
Compared to DualBanksPCs baseline, I-AUROC improves by 7.8%, AUPRO by 2.3%
Exceeds SOTA Shape-guided method (I-AUROC improvement of 2.2%)

Performance Comparison Table:

Method	I-AUROC	AUPRO
Shape-guided	0.916	0.931
DualBanksPCs	0.860	0.911
Ours F2F	0.938	0.934
Ours F2I	0.863	0.912
Ours I2F	0.820	0.942

Asymmetric Performance Phenomenon

MTFI Pipeline (RGB Inference):

Only marginal improvements, F2F method I-AUROC increases from 0.851 to 0.856
Indicates limited effectiveness of generating point cloud hallucinations from RGB

Ablation Studies

Different Feature Extractors: Validates method generalizability on ViT-S/8, ViT-B/8-in21k, and Point-Bert
Distance Metric Comparison: L2 distance performs best in most cases
Coreset Ratio: 10% coreset selection ratio achieves optimal performance balance

Case Analysis

Through visualization analysis, findings include:

Texture Anomalies: For Cable Gland's "thread" anomaly, minimal shape changes in point clouds but obvious texture differences in RGB
Shape Anomalies: For "bent" anomalies, spatial information is required, which RGB images struggle to provide
Composite Anomalies: Cookie's "crack" and Foam's "contamination" anomalies require synergistic multimodal information

Unsupervised 2D Industrial Anomaly Detection

Feature Embedding Methods: Student-teacher architectures, one-class classification, feature distribution mapping
Reconstruction Methods: Autoencoders, GANs, diffusion models
Memory Bank Methods: PatchCore and similar approaches selecting and storing normal features for comparison

3D and Multimodal RGB-3D Industrial Anomaly Detection

AST: Asymmetric student-teacher network preventing student network from learning anomalies
M3DM: Multimodal memory bank method using pre-trained feature extractors
DADA: Learning joint RGB-3D representations

Video Action Recognition: RGB-D cross-modal hallucination networks
Medical Image Segmentation: Learning strategies for handling missing modalities
Salient Object Detection: Cross-modal feature learning

Conclusions and Discussion

Main Conclusions

MTFI Pipeline Feasibility: Demonstrates effectiveness of multimodal training with few-modal inference
Asymmetric Performance: Significant improvements in point cloud inference vs. marginal gains in RGB inference
Information Transfer Mechanisms: Shared texture information can transfer across modalities, but spatial information is difficult to infer from RGB

Limitations

Pre-training Dependency: Relies on pre-trained feature extractors from large-scale datasets
Data Requirements: Requires substantial aligned multimodal training data
Computational Overhead: Two-stage training increases computational complexity
Modal Limitations: Currently validated only on RGB and point cloud modalities

Future Directions

Extension to More Modalities: Ultrasound, infrared, and other industrial detection modalities
Reduced Pre-training Dependency: Explore methods not relying on large-scale pre-training
Practical Deployment: Data collection and validation in real industrial scenarios

In-Depth Evaluation

Strengths

Significant Practical Value: Addresses genuine pain points in the industrial sector
Novel Methodology: First application of cross-modal distillation to incomplete multimodal IAD
Comprehensive Experiments: Validates method effectiveness across multiple datasets and feature extractors
In-Depth Analysis: Provides reasonable explanations for asymmetric performance phenomena
High Engineering Value: F2F method has low computational overhead, suitable for practical deployment

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical analysis of cross-modal information transfer
Dataset Limitations: Primarily validated on synthetic and laboratory data, lacking real industrial environment verification
Modal Extensibility: Method currently limited to RGB and point clouds, extensibility to other modalities unknown
Hyperparameter Sensitivity: Requires learning rate adjustments for different distillation networks

Impact

Academic Contribution: Provides new research direction for incomplete multimodal learning
Practical Value: Offers more cost-effective solutions for industrial quality control
Reproducibility: Provides open-source code facilitating reproduction and extension
Inspirational Value: Provides reference for incomplete multimodal problems in other domains

Applicable Scenarios

Industrial Quality Control: Particularly for high-value products like lithium batteries and composite materials
Medical Diagnosis: Scenarios with multiple imaging modalities but cost constraints
Autonomous Driving: Sensor failure or cost optimization scenarios
Security Surveillance: Multi-modal sensor deployment with maintenance cost considerations

References

This paper cites 67 relevant references, primarily including:

Classical methods in industrial anomaly detection (PatchCore, M3DM, etc.)
Related work on cross-modal knowledge distillation
Foundational methods in 3D point cloud processing and multimodal learning
Original papers of important datasets such as MVTec 3D-AD

Overall Assessment: This is a high-quality paper addressing practical industrial problems. The proposed CMDIAD framework possesses significant theoretical and practical value. While there is room for improvement in theoretical analysis and real-world scenario validation, its innovation and practicality make it an important contribution to the field.