2025-11-12T14:13:10.569513

Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

Hou, Xu, Li et al.
Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero-shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two-stage framework, for zero-shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM's inclination towards object segmentation, we propose the Co-Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM's segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state-of-the-art zero-shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state-of-the-art methods by 10.3\% and 7.7\% in terms of {$F_1$-max} and AP metrics, respectively.
academic

Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

Basic Information

  • Paper ID: 2510.11028
  • Title: Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
  • Authors: Yanning Hou, Ke Xu, Junfa Li, Yanran Ruan, Jianfeng Qiu (School of Artificial Intelligence, Anhui University)
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11028v1

Abstract

This paper proposes a novel two-stage framework for zero-shot anomaly segmentation in industrial anomaly detection. The framework leverages CLIP's powerful anomaly localization capabilities and SAM's boundary-aware abilities. Through the Co-Feature Point Prompt Generation (PPG) module and Cascaded Prompts for SAM (CPS) module, the method achieves state-of-the-art zero-shot anomaly segmentation results across multiple datasets, with F1-max and AP metrics improving by 10.3% and 7.7% respectively on the VisA dataset compared to existing best methods.

Research Background and Motivation

1. Problem Statement

This paper addresses the zero-shot anomaly segmentation (ZSAS) task, particularly in industrial anomaly detection scenarios, where accurate localization and segmentation of anomalous regions must be performed without training data containing anomalous samples.

2. Problem Significance

  • Data Scarcity: Anomalous samples are rare in industrial settings, while traditional methods require extensive annotated data
  • Anomaly Type Diversity: Real-world applications involve diverse anomaly types that are difficult to predefine
  • Industrial Demands: Industry processes millions of product categories, making traditional supervised learning approaches impractical

3. Limitations of Existing Methods

  • CLIP-based Methods: While effective at anomaly localization, they lack boundary-awareness capability, resulting in coarse segmentation
  • SAM-based Methods: Possess strong boundary-awareness but limited localization ability, often segmenting entire objects rather than anomalous regions
  • Existing CLIP & SAM Collaboration Methods: Fail to fully exploit the complementary strengths of both models with overly rigid prompting strategies

4. Research Motivation

Based on the powerful generalization capabilities of foundation models (CLIP and SAM), design an effective collaboration framework that fully leverages CLIP's anomaly localization ability and SAM's precise segmentation capability to achieve high-quality zero-shot anomaly segmentation.

Core Contributions

  1. Novel CLIP-SAM Collaboration Framework: Designs a two-stage zero-shot anomaly segmentation framework that effectively combines CLIP's anomaly localization ability with SAM's boundary-awareness capability
  2. Co-Feature Point Prompt Generation (PPG) Module: Collaboratively utilizes CLIP and SAM features to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than entire objects
  3. Cascaded Prompts for SAM (CPS) Module: Innovatively introduces a cascaded hybrid prompting mechanism to further optimize SAM's segmentation results, eliminating coarse boundaries and isolated noise
  4. State-of-the-Art Performance: Achieves significant performance improvements across multiple datasets, particularly on the VisA dataset with F1-max and AP improvements of 10.3% and 7.7% respectively

Methodology Details

Task Definition

Zero-shot anomaly segmentation is defined as: given a test image, accurately identify and segment anomalous regions without training data containing anomalous samples, outputting pixel-level anomaly masks.

Model Architecture

Overall Architecture

The framework employs a two-stage design:

  1. Stage One: PPG module generates initial point prompts
  2. Stage Two: CPS module optimizes segmentation results through cascaded prompts

PPG Module Design Details

Positive Point Localization:

Ra = Sa ⊗ Mapa                    (1)
Ph = Topk(Ra)                     (2)

Where Sa represents extreme anomalous regions, Mapa is the anomaly map generated by CLIP, Ra is their intersection, and Ph represents the top-k anomalous points selected as positive prompts.

Negative Point Localization:

Na = dilate(Sa) - Sa              (3)
F = EncI(img)                     (4)
Fa = F ⊗ Sa, Fn = F ⊗ Na         (5)
Maps = Similarity(Fa, Fn)         (6)
Pl = Lowestk(Maps)                (7)

The surrounding region Na is obtained through dilation, features F are extracted using SAM's image encoder, cosine similarity between anomalous and surrounding region features is computed, and the k pixels with lowest similarity are selected as negative prompts.

CPS Module Design Details

Three-Level Cascaded Structure:

  1. Point Prompts Only:
P = Contact(Ph, Pl)               (8)
M1, logit1 = Decm(F, P)           (9)
  1. Point + Logit Prompts:
M2, logit2 = Decm(F, Contact(P, logit1))    (10)
  1. Point + Bounding Box + Logit Prompts:
box = Flocation(M2)               (11)
M3 = Decm(F, Contact(P, box, logit2))       (12)

Technical Innovations

  1. Collaborative Feature Utilization: Unlike existing sequential approaches, the PPG module simultaneously leverages both CLIP and SAM features for point prompt generation
  2. Intelligent Negative Point Selection: Through dilation functions and feature similarity computation, more effective negative prompts are selected, preventing SAM from segmenting entire objects
  3. Progressive Constraint Enhancement: The CPS module progressively strengthens constraints on SAM through three-level cascading for precise segmentation
  4. Lightweight Design: Utilizes only SAM's lightweight decoder for iterative optimization with additional computational overhead of merely 100 milliseconds

Experimental Setup

Datasets

  • MVTec-AD: Contains high-resolution industrial object images with complete pixel-level annotations
  • VisA: Industrial anomaly detection dataset containing multiple anomaly types

Evaluation Metrics

  • AUROC: Reflects the model's ability to distinguish classes at different threshold levels
  • F1-max: Harmonic mean of precision and recall at optimal threshold
  • AP (Average Precision): Precision at different recall levels

Comparison Methods

  • CLIP-based Methods: WinCLIP, APRIL-GAN, SDP, SDP+, AnomalyCLIP
  • SAM-based Methods: SAA, SAA+
  • CLIP & SAM Collaboration Methods: ClipSAM

Implementation Details

  • CLIP Model: Pre-trained ViT-L-14-336 model
  • SAM Model: ViT-H pre-trained model
  • Optimizer: Adam, learning rate 1e-3
  • Training Settings: 3 epochs for VisA dataset, 15 epochs for MVTec-AD dataset
  • Hardware: NVIDIA GeForce RTX 3090, batch size 16

Experimental Results

Main Results

Method CategoryMethodMVTec-ADVisA
AUROCF1-maxAPAUROCF1-maxAP
CLIP-basedWinCLIP85.131.7-79.614.8-
APRIL-GAN87.643.340.894.232.325.7
AnomalyCLIP91.139.134.595.528.321.3
SAM-basedSAA+73.237.828.874.027.122.4
CLIP&SAMClipSAM92.347.845.995.633.126.0
ProposedOurs89.548.846.494.836.528.0

Key Findings:

  • Comprehensively surpasses existing methods on F1-max and AP metrics
  • Improves F1-max by 10.3% and AP by 7.7% on VisA dataset
  • Improves F1-max by 2.1% and AP by 1.1% on MVTec-AD dataset
  • Slightly lower AUROC compared to best methods due to reliance on SAM segmentation results causing anomalous region expansion

Ablation Studies

Impact of Dilation Function Parameters

Tests the impact of different kernel shapes and sizes on performance:

ShapeSizeAUROCF1-maxAP
Ellipse(25,25)89.548.846.4
Rectangle(20,20)89.547.745.6
Cross(25,25)89.246.544.1

Conclusion: Elliptical kernel (25,25) achieves optimal performance.

Cascading Stage Effects

Cascade StageAUROCF1-maxAP
Point prompts only88.742.539.2
Point + logit188.146.844.8
Point + box + logit289.548.846.4

Key Findings:

  • Second cascade increases F1-max by 4.3% and AP by 5.6%
  • Third cascade further improves F1-max by 2% and AP by 1.6%

Case Analysis

Visualization results demonstrate:

  • CLIP-based methods accurately localize anomalies but with blurry boundaries
  • SAM-based methods have clear boundaries but inaccurate localization
  • The proposed method achieves both accurate localization and clear boundaries

Foundation Models

  • CLIP: First model pre-trained on web-scale image-text pairs with powerful multimodal alignment capability
  • SAM: Demonstrates strong open-world object segmentation capability, enabling high-quality segmentation with various prompts

Zero-Shot Anomaly Segmentation Methods

  1. CLIP-based Methods: Utilize sliding windows and multi-layer features, but with limited boundary-awareness
  2. SAM-based Methods: Possess strong boundary-awareness but limited localization ability
  3. CLIP & SAM Collaboration Methods: Existing methods fail to fully exploit complementary strengths of both models

Advantages of This Work

Compared to existing work, this paper better leverages the advantages of both foundation models through collaborative feature utilization and cascaded prompting mechanisms.

Conclusions and Discussion

Main Conclusions

  1. The proposed CLIP-SAM collaboration framework effectively combines the advantages of both foundation models
  2. PPG and CPS modules significantly improve zero-shot anomaly segmentation performance
  3. Achieves state-of-the-art performance levels across multiple datasets

Limitations

  1. Inference Speed: Using two models results in slower inference time
  2. AUROC Performance: Slightly inferior to some methods on AUROC metric
  3. Computational Resources: Requires substantial computational resources

Future Directions

Authors mention continued exploration of how to efficiently and lightly integrate advantages of different models to enhance anomaly segmentation capability.

In-Depth Evaluation

Strengths

  1. Strong Method Innovation: PPG and CPS modules are ingeniously designed, effectively addressing limitations of existing methods
  2. Comprehensive Experiments: Thorough comparative and ablation studies across multiple datasets
  3. Significant Performance Gains: Substantial improvements on key metrics
  4. Clear Technical Details: Detailed method description with clear formula derivations

Weaknesses

  1. Computational Efficiency Issues: While authors claim only 100 milliseconds additional overhead, overall inference time remains lengthy
  2. AUROC Performance Decline: Performance decreases on the important AUROC metric, requiring further optimization
  3. Limited Generalization Assessment: Evaluation on only two datasets; generalization capability requires broader verification

Impact

  1. Academic Contribution: Provides new perspectives and methods for zero-shot anomaly detection research
  2. Practical Value: Significant application value in industrial anomaly detection
  3. Reproducibility: Detailed method description and implementation details facilitate reproduction

Applicable Scenarios

  • Industrial quality inspection
  • Medical image anomaly detection
  • Security monitoring anomaly event detection
  • Other applications requiring zero-shot anomaly segmentation

References

The paper cites 40 related references covering important works in foundation models, anomaly detection, and computer vision, providing comprehensive literature review.


Overall Assessment: The proposed CLIP-SAM collaboration framework demonstrates technical innovation with impressive experimental results. While there remains room for improvement in computational efficiency and certain metrics, the work makes important contributions to zero-shot anomaly detection research with significant academic and practical value.