2025-11-12T10:52:10.099968

Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation

Sui, Lichau, Lefèvre et al.
Recent studies of multimodal industrial anomaly detection (IAD) based on 3D point clouds and RGB images have highlighted the importance of exploiting the redundancy and complementarity among modalities for accurate classification and segmentation. However, achieving multimodal IAD in practical production lines remains a work in progress. It is essential to consider the trade-offs between the costs and benefits associated with the introduction of new modalities while ensuring compatibility with current processes. Existing quality control processes combine rapid in-line inspections, such as optical and infrared imaging with high-resolution but time-consuming near-line characterization techniques, including industrial CT and electron microscopy to manually or semi-automatically locate and analyze defects in the production of Li-ion batteries and composite materials. Given the cost and time limitations, only a subset of the samples can be inspected by all in-line and near-line methods, and the remaining samples are only evaluated through one or two forms of in-line inspection. To fully exploit data for deep learning-driven automatic defect detection, the models must have the ability to leverage multimodal training and handle incomplete modalities during inference. In this paper, we propose CMDIAD, a Cross-Modal Distillation framework for IAD to demonstrate the feasibility of a Multi-modal Training, Few-modal Inference (MTFI) pipeline. Our findings show that the MTFI pipeline can more effectively utilize incomplete multimodal information compared to applying only a single modality for training and inference. Moreover, we investigate the reasons behind the asymmetric performance improvement using point clouds or RGB images as the main modality of inference. This provides a foundation for our future multimodal dataset construction with additional modalities from manufacturing scenarios.
academic

Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation

Basic Information

  • Paper ID: 2405.13571
  • Title: Incomplete Multimodal Industrial Anomaly Detection via Cross-Modal Distillation
  • Authors: Wenbo Sui, Daniel Lichau, Josselin Lefèvre, Harold Phelippeau
  • Category: cs.CV
  • Published Journal: Information Fusion 126 (2026) 103572
  • Paper Link: https://arxiv.org/abs/2405.13571
  • Code Link: https://github.com/evenrose/CMDIAD

Abstract

This paper addresses a practical challenge in industrial anomaly detection: in real production lines, complete multimodal detection of all samples is infeasible due to cost and time constraints. The authors propose the CMDIAD framework, which implements a multimodal training with few-modal inference (MTFI) pipeline. Through cross-modal knowledge distillation techniques, the model can leverage complete multimodal data during training while achieving superior performance using only partial modalities during inference.

Research Background and Motivation

Problem Definition

In industrial anomaly detection, existing multimodal methods typically require complete modal information during both training and inference. However, in real production environments:

  1. Cost Constraints: High-resolution detection technologies (e.g., industrial CT, electron microscopy) are expensive and time-consuming
  2. Practical Limitations: Only a subset of samples can undergo full-modal detection, while most samples can only be assessed through 1-2 rapid online detection methods
  3. Insufficient Data Utilization: Existing methods cannot fully leverage multimodal information from the training phase to improve single-modal inference performance

Research Significance

This problem is critical in practical industrial scenarios such as lithium battery and composite material production. Solving it enables:

  • Reduction of quality control costs
  • Improved detection efficiency
  • Full utilization of limited multimodal training data

Limitations of Existing Methods

  1. Complete Modal Dependency: Existing multimodal IAD methods require complete modalities during both training and inference
  2. Limited Missing Modal Handling: Few studies address missing modalities, primarily employing simple late fusion strategies
  3. Information Waste: Unable to leverage multimodal information from training to improve single-modal inference performance

Core Contributions

  1. First Incomplete Multimodal IAD: To the authors' knowledge, this is the first work addressing industrial anomaly detection with incomplete multimodal data
  2. CMDIAD Framework: Proposes a novel multimodal IAD framework based on cross-modal distillation, enabling multimodal training with few-modal inference
  3. MTFI Pipeline: Demonstrates the feasibility and effectiveness of the multimodal training, few-modal inference pipeline
  4. Modal Correlation Analysis: Provides in-depth analysis of information transfer mechanisms between different modalities, offering guidance for future dataset construction

Methodology Details

Task Definition

  • Input: Paired RGB images and 3D point clouds during training; single modality (RGB or point cloud) during inference
  • Output: Image-level and pixel-level anomaly detection results
  • Objective: Enable single-modal inference performance to exceed baseline methods using only that modality for training and inference

Model Architecture

1. Feature Extraction Module

  • RGB Feature Extraction: Uses pre-trained DINO ViT-B/8 to extract RGB features, outputting dimension R^(2Hf×2Wf×d1)
  • Point Cloud Feature Extraction: Uses Point-MAE to extract point cloud features, obtaining RGB-aligned feature maps through FPS sampling and IDW interpolation

2. Cross-Modal Distillation Network

Proposes three distillation pathways:

Feature-to-Feature (F2F):

H^f_RGB^(i,j) = F2F(R^(i,j)_PC)

Uses three-layer MLP to establish direct mapping from feature space to feature space.

Feature-to-Input (F2I):

H^f_RGB = ℱ_RGB(H^i_RGB), H^i_RGB = F2I(R_PC)

Generates another modality's input from one modality's features.

Input-to-Feature (I2F):

H^f_RGB = I2F(I_PC)

Directly generates target modality features from input.

3. Memory Bank Construction

Uses greedy algorithm for coreset selection:

p_{i+1} = arg max_{p_j∈S,i≠j} D_c(p_i, p_j)

Employs sparse random projection for dimensionality reduction to improve computational efficiency.

4. Decision Layer Fusion

Uses two one-class support vector machines for classification and segmentation:

c = C_c(αψ(F_PC, M_PC), βψ(F_RGB, M_RGB))
s = C_s(αφ(F_PC, M_PC), βφ(F_RGB, M_RGB))

Technical Innovations

  1. Cross-Modal Hallucination Generation: Learns cross-modal mappings to generate "hallucinated" features of missing modalities during inference
  2. Multi-Path Distillation Strategy: Provides three distillation methods at different levels, balancing computational complexity and performance
  3. Asymmetric Performance Analysis: Provides in-depth analysis of performance differences across different distillation directions and their underlying causes

Experimental Setup

Datasets

  • MVTec 3D-AD: Contains 10 object classes, 3-5 defect types per class, with pixel-level binary annotations
  • Eyecandies: Synthetic RGB+3D anomaly detection dataset

Evaluation Metrics

  • I-AUROC: Area under ROC curve for image-level anomaly detection
  • P-AUROC: Area under ROC curve for pixel-level anomaly detection
  • AUPRO: Average area under per-region-overlap curve, reducing impact of anomaly size on evaluation

Comparison Methods

  • DualBanksPCs/RGB: Dual memory bank method using single modality only
  • Shape-guided: SOTA method specifically designed for point clouds
  • M3DM: Multimodal memory bank method
  • AST: Asymmetric student-teacher network

Implementation Details

  • Optimizer: Adam, batch size 32, 10 warmup epochs
  • Learning Rates: 0.0005 for F2F and F2I, 0.0003 for I2F
  • Training Epochs: 100 with early stopping based on validation set
  • Hardware: NVIDIA RTX A6000, 256GB memory

Experimental Results

Main Results

MTFI Pipeline (Point Cloud Inference) Performance:

  • F2F method achieves I-AUROC 0.938, AUPRO 0.934 on MVTec 3D-AD
  • Compared to DualBanksPCs baseline, I-AUROC improves by 7.8%, AUPRO by 2.3%
  • Exceeds SOTA Shape-guided method (I-AUROC improvement of 2.2%)

Performance Comparison Table:

MethodI-AUROCAUPRO
Shape-guided0.9160.931
DualBanksPCs0.8600.911
Ours F2F0.9380.934
Ours F2I0.8630.912
Ours I2F0.8200.942

Asymmetric Performance Phenomenon

MTFI Pipeline (RGB Inference):

  • Only marginal improvements, F2F method I-AUROC increases from 0.851 to 0.856
  • Indicates limited effectiveness of generating point cloud hallucinations from RGB

Ablation Studies

  1. Different Feature Extractors: Validates method generalizability on ViT-S/8, ViT-B/8-in21k, and Point-Bert
  2. Distance Metric Comparison: L2 distance performs best in most cases
  3. Coreset Ratio: 10% coreset selection ratio achieves optimal performance balance

Case Analysis

Through visualization analysis, findings include:

  1. Texture Anomalies: For Cable Gland's "thread" anomaly, minimal shape changes in point clouds but obvious texture differences in RGB
  2. Shape Anomalies: For "bent" anomalies, spatial information is required, which RGB images struggle to provide
  3. Composite Anomalies: Cookie's "crack" and Foam's "contamination" anomalies require synergistic multimodal information

Unsupervised 2D Industrial Anomaly Detection

  • Feature Embedding Methods: Student-teacher architectures, one-class classification, feature distribution mapping
  • Reconstruction Methods: Autoencoders, GANs, diffusion models
  • Memory Bank Methods: PatchCore and similar approaches selecting and storing normal features for comparison

3D and Multimodal RGB-3D Industrial Anomaly Detection

  • AST: Asymmetric student-teacher network preventing student network from learning anomalies
  • M3DM: Multimodal memory bank method using pre-trained feature extractors
  • DADA: Learning joint RGB-3D representations

Cross-Modal Knowledge Distillation

  • Video Action Recognition: RGB-D cross-modal hallucination networks
  • Medical Image Segmentation: Learning strategies for handling missing modalities
  • Salient Object Detection: Cross-modal feature learning

Conclusions and Discussion

Main Conclusions

  1. MTFI Pipeline Feasibility: Demonstrates effectiveness of multimodal training with few-modal inference
  2. Asymmetric Performance: Significant improvements in point cloud inference vs. marginal gains in RGB inference
  3. Information Transfer Mechanisms: Shared texture information can transfer across modalities, but spatial information is difficult to infer from RGB

Limitations

  1. Pre-training Dependency: Relies on pre-trained feature extractors from large-scale datasets
  2. Data Requirements: Requires substantial aligned multimodal training data
  3. Computational Overhead: Two-stage training increases computational complexity
  4. Modal Limitations: Currently validated only on RGB and point cloud modalities

Future Directions

  1. Extension to More Modalities: Ultrasound, infrared, and other industrial detection modalities
  2. Reduced Pre-training Dependency: Explore methods not relying on large-scale pre-training
  3. Practical Deployment: Data collection and validation in real industrial scenarios

In-Depth Evaluation

Strengths

  1. Significant Practical Value: Addresses genuine pain points in the industrial sector
  2. Novel Methodology: First application of cross-modal distillation to incomplete multimodal IAD
  3. Comprehensive Experiments: Validates method effectiveness across multiple datasets and feature extractors
  4. In-Depth Analysis: Provides reasonable explanations for asymmetric performance phenomena
  5. High Engineering Value: F2F method has low computational overhead, suitable for practical deployment

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical analysis of cross-modal information transfer
  2. Dataset Limitations: Primarily validated on synthetic and laboratory data, lacking real industrial environment verification
  3. Modal Extensibility: Method currently limited to RGB and point clouds, extensibility to other modalities unknown
  4. Hyperparameter Sensitivity: Requires learning rate adjustments for different distillation networks

Impact

  1. Academic Contribution: Provides new research direction for incomplete multimodal learning
  2. Practical Value: Offers more cost-effective solutions for industrial quality control
  3. Reproducibility: Provides open-source code facilitating reproduction and extension
  4. Inspirational Value: Provides reference for incomplete multimodal problems in other domains

Applicable Scenarios

  1. Industrial Quality Control: Particularly for high-value products like lithium batteries and composite materials
  2. Medical Diagnosis: Scenarios with multiple imaging modalities but cost constraints
  3. Autonomous Driving: Sensor failure or cost optimization scenarios
  4. Security Surveillance: Multi-modal sensor deployment with maintenance cost considerations

References

This paper cites 67 relevant references, primarily including:

  • Classical methods in industrial anomaly detection (PatchCore, M3DM, etc.)
  • Related work on cross-modal knowledge distillation
  • Foundational methods in 3D point cloud processing and multimodal learning
  • Original papers of important datasets such as MVTec 3D-AD

Overall Assessment: This is a high-quality paper addressing practical industrial problems. The proposed CMDIAD framework possesses significant theoretical and practical value. While there is room for improvement in theoretical analysis and real-world scenario validation, its innovation and practicality make it an important contribution to the field.