Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
Hou, Xu, Li et al.
Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero-shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two-stage framework, for zero-shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM's inclination towards object segmentation, we propose the Co-Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM's segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state-of-the-art zero-shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state-of-the-art methods by 10.3\% and 7.7\% in terms of {$F_1$-max} and AP metrics, respectively.
academic
Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
This paper proposes a novel two-stage framework for zero-shot anomaly segmentation in industrial anomaly detection. The framework leverages CLIP's powerful anomaly localization capabilities and SAM's boundary-aware abilities. Through the Co-Feature Point Prompt Generation (PPG) module and Cascaded Prompts for SAM (CPS) module, the method achieves state-of-the-art zero-shot anomaly segmentation results across multiple datasets, with F1-max and AP metrics improving by 10.3% and 7.7% respectively on the VisA dataset compared to existing best methods.
This paper addresses the zero-shot anomaly segmentation (ZSAS) task, particularly in industrial anomaly detection scenarios, where accurate localization and segmentation of anomalous regions must be performed without training data containing anomalous samples.
Based on the powerful generalization capabilities of foundation models (CLIP and SAM), design an effective collaboration framework that fully leverages CLIP's anomaly localization ability and SAM's precise segmentation capability to achieve high-quality zero-shot anomaly segmentation.
Novel CLIP-SAM Collaboration Framework: Designs a two-stage zero-shot anomaly segmentation framework that effectively combines CLIP's anomaly localization ability with SAM's boundary-awareness capability
Co-Feature Point Prompt Generation (PPG) Module: Collaboratively utilizes CLIP and SAM features to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than entire objects
Cascaded Prompts for SAM (CPS) Module: Innovatively introduces a cascaded hybrid prompting mechanism to further optimize SAM's segmentation results, eliminating coarse boundaries and isolated noise
State-of-the-Art Performance: Achieves significant performance improvements across multiple datasets, particularly on the VisA dataset with F1-max and AP improvements of 10.3% and 7.7% respectively
Zero-shot anomaly segmentation is defined as: given a test image, accurately identify and segment anomalous regions without training data containing anomalous samples, outputting pixel-level anomaly masks.
Where Sa represents extreme anomalous regions, Mapa is the anomaly map generated by CLIP, Ra is their intersection, and Ph represents the top-k anomalous points selected as positive prompts.
Negative Point Localization:
Na = dilate(Sa) - Sa (3)
F = EncI(img) (4)
Fa = F ⊗ Sa, Fn = F ⊗ Na (5)
Maps = Similarity(Fa, Fn) (6)
Pl = Lowestk(Maps) (7)
The surrounding region Na is obtained through dilation, features F are extracted using SAM's image encoder, cosine similarity between anomalous and surrounding region features is computed, and the k pixels with lowest similarity are selected as negative prompts.
Collaborative Feature Utilization: Unlike existing sequential approaches, the PPG module simultaneously leverages both CLIP and SAM features for point prompt generation
Intelligent Negative Point Selection: Through dilation functions and feature similarity computation, more effective negative prompts are selected, preventing SAM from segmenting entire objects
Progressive Constraint Enhancement: The CPS module progressively strengthens constraints on SAM through three-level cascading for precise segmentation
Lightweight Design: Utilizes only SAM's lightweight decoder for iterative optimization with additional computational overhead of merely 100 milliseconds
Compared to existing work, this paper better leverages the advantages of both foundation models through collaborative feature utilization and cascaded prompting mechanisms.
Authors mention continued exploration of how to efficiently and lightly integrate advantages of different models to enhance anomaly segmentation capability.
The paper cites 40 related references covering important works in foundation models, anomaly detection, and computer vision, providing comprehensive literature review.
Overall Assessment: The proposed CLIP-SAM collaboration framework demonstrates technical innovation with impressive experimental results. While there remains room for improvement in computational efficiency and certain metrics, the work makes important contributions to zero-shot anomaly detection research with significant academic and practical value.