2025-11-12T14:13:10.569513

Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

Hou, Xu, Li et al.

Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero-shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two-stage framework, for zero-shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM's inclination towards object segmentation, we propose the Co-Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM's segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state-of-the-art zero-shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state-of-the-art methods by 10.3\% and 7.7\% in terms of {$F_1$-max} and AP metrics, respectively.

academic

Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

Basic Information

Paper ID: 2510.11028
Title: Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
Authors: Yanning Hou, Ke Xu, Junfa Li, Yanran Ruan, Jianfeng Qiu (School of Artificial Intelligence, Anhui University)
Category: cs.CV (Computer Vision)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11028v1

Abstract

This paper proposes a novel two-stage framework for zero-shot anomaly segmentation in industrial anomaly detection. The framework leverages CLIP's powerful anomaly localization capabilities and SAM's boundary-aware abilities. Through the Co-Feature Point Prompt Generation (PPG) module and Cascaded Prompts for SAM (CPS) module, the method achieves state-of-the-art zero-shot anomaly segmentation results across multiple datasets, with F1-max and AP metrics improving by 10.3% and 7.7% respectively on the VisA dataset compared to existing best methods.

Research Background and Motivation

1. Problem Statement

This paper addresses the zero-shot anomaly segmentation (ZSAS) task, particularly in industrial anomaly detection scenarios, where accurate localization and segmentation of anomalous regions must be performed without training data containing anomalous samples.

2. Problem Significance

Data Scarcity: Anomalous samples are rare in industrial settings, while traditional methods require extensive annotated data
Anomaly Type Diversity: Real-world applications involve diverse anomaly types that are difficult to predefine
Industrial Demands: Industry processes millions of product categories, making traditional supervised learning approaches impractical

3. Limitations of Existing Methods

CLIP-based Methods: While effective at anomaly localization, they lack boundary-awareness capability, resulting in coarse segmentation
SAM-based Methods: Possess strong boundary-awareness but limited localization ability, often segmenting entire objects rather than anomalous regions
Existing CLIP & SAM Collaboration Methods: Fail to fully exploit the complementary strengths of both models with overly rigid prompting strategies

4. Research Motivation

Based on the powerful generalization capabilities of foundation models (CLIP and SAM), design an effective collaboration framework that fully leverages CLIP's anomaly localization ability and SAM's precise segmentation capability to achieve high-quality zero-shot anomaly segmentation.

Core Contributions

Novel CLIP-SAM Collaboration Framework: Designs a two-stage zero-shot anomaly segmentation framework that effectively combines CLIP's anomaly localization ability with SAM's boundary-awareness capability
Co-Feature Point Prompt Generation (PPG) Module: Collaboratively utilizes CLIP and SAM features to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than entire objects
Cascaded Prompts for SAM (CPS) Module: Innovatively introduces a cascaded hybrid prompting mechanism to further optimize SAM's segmentation results, eliminating coarse boundaries and isolated noise
State-of-the-Art Performance: Achieves significant performance improvements across multiple datasets, particularly on the VisA dataset with F1-max and AP improvements of 10.3% and 7.7% respectively

Methodology Details

Task Definition

Zero-shot anomaly segmentation is defined as: given a test image, accurately identify and segment anomalous regions without training data containing anomalous samples, outputting pixel-level anomaly masks.

Model Architecture

Overall Architecture

The framework employs a two-stage design:

Stage One: PPG module generates initial point prompts
Stage Two: CPS module optimizes segmentation results through cascaded prompts

PPG Module Design Details

Positive Point Localization:

Ra = Sa ⊗ Mapa                    (1)
Ph = Topk(Ra)                     (2)

Where Sa represents extreme anomalous regions, Mapa is the anomaly map generated by CLIP, Ra is their intersection, and Ph represents the top-k anomalous points selected as positive prompts.

Negative Point Localization:

Na = dilate(Sa) - Sa              (3)
F = EncI(img)                     (4)
Fa = F ⊗ Sa, Fn = F ⊗ Na         (5)
Maps = Similarity(Fa, Fn)         (6)
Pl = Lowestk(Maps)                (7)

The surrounding region Na is obtained through dilation, features F are extracted using SAM's image encoder, cosine similarity between anomalous and surrounding region features is computed, and the k pixels with lowest similarity are selected as negative prompts.

CPS Module Design Details

Three-Level Cascaded Structure:

Point Prompts Only:

P = Contact(Ph, Pl)               (8)
M1, logit1 = Decm(F, P)           (9)

Point + Logit Prompts:

M2, logit2 = Decm(F, Contact(P, logit1))    (10)

Point + Bounding Box + Logit Prompts:

box = Flocation(M2)               (11)
M3 = Decm(F, Contact(P, box, logit2))       (12)

Technical Innovations

Collaborative Feature Utilization: Unlike existing sequential approaches, the PPG module simultaneously leverages both CLIP and SAM features for point prompt generation
Intelligent Negative Point Selection: Through dilation functions and feature similarity computation, more effective negative prompts are selected, preventing SAM from segmenting entire objects
Progressive Constraint Enhancement: The CPS module progressively strengthens constraints on SAM through three-level cascading for precise segmentation
Lightweight Design: Utilizes only SAM's lightweight decoder for iterative optimization with additional computational overhead of merely 100 milliseconds

Experimental Setup

Datasets

MVTec-AD: Contains high-resolution industrial object images with complete pixel-level annotations
VisA: Industrial anomaly detection dataset containing multiple anomaly types

Evaluation Metrics

AUROC: Reflects the model's ability to distinguish classes at different threshold levels
F1-max: Harmonic mean of precision and recall at optimal threshold
AP (Average Precision): Precision at different recall levels

Comparison Methods

CLIP-based Methods: WinCLIP, APRIL-GAN, SDP, SDP+, AnomalyCLIP
SAM-based Methods: SAA, SAA+
CLIP & SAM Collaboration Methods: ClipSAM

Implementation Details

CLIP Model: Pre-trained ViT-L-14-336 model
SAM Model: ViT-H pre-trained model
Optimizer: Adam, learning rate 1e-3
Training Settings: 3 epochs for VisA dataset, 15 epochs for MVTec-AD dataset
Hardware: NVIDIA GeForce RTX 3090, batch size 16

Experimental Results

Main Results

Method Category	Method	MVTec-AD			VisA
		AUROC	F1-max	AP	AUROC	F1-max	AP
CLIP-based	WinCLIP	85.1	31.7	-	79.6	14.8	-
	APRIL-GAN	87.6	43.3	40.8	94.2	32.3	25.7
	AnomalyCLIP	91.1	39.1	34.5	95.5	28.3	21.3
SAM-based	SAA+	73.2	37.8	28.8	74.0	27.1	22.4
CLIP&SAM	ClipSAM	92.3	47.8	45.9	95.6	33.1	26.0
Proposed	Ours	89.5	48.8	46.4	94.8	36.5	28.0

Key Findings:

Comprehensively surpasses existing methods on F1-max and AP metrics
Improves F1-max by 10.3% and AP by 7.7% on VisA dataset
Improves F1-max by 2.1% and AP by 1.1% on MVTec-AD dataset
Slightly lower AUROC compared to best methods due to reliance on SAM segmentation results causing anomalous region expansion

Ablation Studies

Impact of Dilation Function Parameters

Tests the impact of different kernel shapes and sizes on performance:

Shape	Size	AUROC	F1-max	AP
Ellipse	(25,25)	89.5	48.8	46.4
Rectangle	(20,20)	89.5	47.7	45.6
Cross	(25,25)	89.2	46.5	44.1

Conclusion: Elliptical kernel (25,25) achieves optimal performance.

Cascading Stage Effects

Cascade Stage	AUROC	F1-max	AP
Point prompts only	88.7	42.5	39.2
Point + logit1	88.1	46.8	44.8
Point + box + logit2	89.5	48.8	46.4

Key Findings:

Second cascade increases F1-max by 4.3% and AP by 5.6%
Third cascade further improves F1-max by 2% and AP by 1.6%

Case Analysis

Visualization results demonstrate:

CLIP-based methods accurately localize anomalies but with blurry boundaries
SAM-based methods have clear boundaries but inaccurate localization
The proposed method achieves both accurate localization and clear boundaries

Foundation Models

CLIP: First model pre-trained on web-scale image-text pairs with powerful multimodal alignment capability
SAM: Demonstrates strong open-world object segmentation capability, enabling high-quality segmentation with various prompts

Zero-Shot Anomaly Segmentation Methods

CLIP-based Methods: Utilize sliding windows and multi-layer features, but with limited boundary-awareness
SAM-based Methods: Possess strong boundary-awareness but limited localization ability
CLIP & SAM Collaboration Methods: Existing methods fail to fully exploit complementary strengths of both models

Advantages of This Work

Compared to existing work, this paper better leverages the advantages of both foundation models through collaborative feature utilization and cascaded prompting mechanisms.

Conclusions and Discussion

Main Conclusions

The proposed CLIP-SAM collaboration framework effectively combines the advantages of both foundation models
PPG and CPS modules significantly improve zero-shot anomaly segmentation performance
Achieves state-of-the-art performance levels across multiple datasets

Limitations

Inference Speed: Using two models results in slower inference time
AUROC Performance: Slightly inferior to some methods on AUROC metric
Computational Resources: Requires substantial computational resources

Future Directions

Authors mention continued exploration of how to efficiently and lightly integrate advantages of different models to enhance anomaly segmentation capability.

In-Depth Evaluation

Strengths

Strong Method Innovation: PPG and CPS modules are ingeniously designed, effectively addressing limitations of existing methods
Comprehensive Experiments: Thorough comparative and ablation studies across multiple datasets
Significant Performance Gains: Substantial improvements on key metrics
Clear Technical Details: Detailed method description with clear formula derivations

Weaknesses

Computational Efficiency Issues: While authors claim only 100 milliseconds additional overhead, overall inference time remains lengthy
AUROC Performance Decline: Performance decreases on the important AUROC metric, requiring further optimization
Limited Generalization Assessment: Evaluation on only two datasets; generalization capability requires broader verification

Impact

Academic Contribution: Provides new perspectives and methods for zero-shot anomaly detection research
Practical Value: Significant application value in industrial anomaly detection
Reproducibility: Detailed method description and implementation details facilitate reproduction

Applicable Scenarios

Industrial quality inspection
Medical image anomaly detection
Security monitoring anomaly event detection
Other applications requiring zero-shot anomaly segmentation

References

The paper cites 40 related references covering important works in foundation models, anomaly detection, and computer vision, providing comprehensive literature review.

Overall Assessment: The proposed CLIP-SAM collaboration framework demonstrates technical innovation with impressive experimental results. While there remains room for improvement in computational efficiency and certain metrics, the work makes important contributions to zero-shot anomaly detection research with significant academic and practical value.