2025-11-12T14:13:10.569513

Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

Hou, Xu, Li et al.

Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero-shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two-stage framework, for zero-shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM's inclination towards object segmentation, we propose the Co-Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM's segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state-of-the-art zero-shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state-of-the-art methods by 10.3\% and 7.7\% in terms of {$F_1$-max} and AP metrics, respectively.

academic

Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts

基本信息

论文ID: 2510.11028
标题: Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
作者: Yanning Hou, Ke Xu, Junfa Li, Yanran Ruan, Jianfeng Qiu (安徽大学人工智能学院)
分类: cs.CV (计算机视觉)
发表时间: 2025年10月13日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.11028v1

摘要

本文提出了一个新颖的两阶段框架，用于工业异常检测中的零样本异常分割任务。该框架充分利用了CLIP强大的异常定位能力和SAM的边界感知能力。通过Co-Feature Point Prompt Generation (PPG)模块和Cascaded Prompts for SAM (CPS)模块，该方法在多个数据集上实现了最先进的零样本异常分割结果，特别是在VisA数据集上，F1-max和AP指标分别比现有最佳方法提升了10.3%和7.7%。

研究背景与动机

1. 要解决的问题

本文主要解决零样本异常分割(Zero-Shot Anomaly Segmentation, ZSAS)任务，特别是在工业异常检测场景中，需要在没有异常样本训练数据的情况下，准确定位和分割图像中的异常区域。

2. 问题的重要性

数据稀缺性：工业场景中异常样本稀少，传统方法需要大量标注数据
异常类型多样性：实际应用中异常类型变化多样，难以预先定义
工业需求：工业界处理数百万产品类别，传统监督学习方法不切实际

3. 现有方法的局限性

CLIP-based方法：虽然能够有效定位异常，但边界感知能力差，分割结果粗糙
SAM-based方法：具有强大的边界感知能力，但定位能力有限，容易分割整个对象而非异常区域
现有CLIP&SAM协作方法：未充分利用两个模型的各自优势，提示策略过于固化

4. 研究动机

基于基础模型(CLIP和SAM)的强大泛化能力，设计一个有效的协作框架，充分发挥CLIP的异常定位能力和SAM的精确分割能力，实现高质量的零样本异常分割。

核心贡献

提出了新颖的CLIP-SAM协作框架：设计了一个两阶段的零样本异常分割框架，有效结合了CLIP的异常定位能力和SAM的边界感知能力
Co-Feature Point Prompt Generation (PPG)模块：通过协作利用CLIP和SAM生成正负点提示，引导SAM专注于分割异常区域而非整个对象
Cascaded Prompts for SAM (CPS)模块：创新性地引入级联混合提示机制，进一步优化SAM的分割结果，消除粗糙边界和孤立噪声
达到最先进性能：在多个数据集上取得了显著的性能提升，特别是在VisA数据集上F1-max和AP指标分别提升10.3%和7.7%

方法详解

任务定义

零样本异常分割任务定义为：给定一张测试图像，在没有异常样本训练数据的情况下，准确识别并分割出图像中的异常区域，输出像素级的异常掩码。

模型架构

整体架构

该框架采用两阶段设计：

第一阶段：PPG模块生成初始点提示
第二阶段：CPS模块通过级联提示优化分割结果

PPG模块详细设计

正点定位：

Ra = Sa ⊗ Mapa                    (1)
Ph = Topk(Ra)                     (2)

其中Sa为极端异常区域，Mapa为CLIP生成的异常图，Ra为两者的交集，Ph为选取的top-k异常点作为正点提示。

负点定位：

Na = dilate(Sa) - Sa              (3)
F = EncI(img)                     (4)
Fa = F ⊗ Sa, Fn = F ⊗ Na         (5)
Maps = Similarity(Fa, Fn)         (6)
Pl = Lowestk(Maps)                (7)

通过膨胀函数获取异常区域周围区域Na，利用SAM图像编码器提取特征F，计算异常区域和周围区域特征的余弦相似度，选择相似度最低的k个像素作为负点提示。

CPS模块详细设计

三级级联结构：

仅点提示：

P = Contact(Ph, Pl)               (8)
M1, logit1 = Decm(F, P)           (9)

点+logit提示：

M2, logit2 = Decm(F, Contact(P, logit1))    (10)

点+边界框+logit提示：

box = Flocation(M2)               (11)
M3 = Decm(F, Contact(P, box, logit2))       (12)

技术创新点

协作特征利用：不同于现有方法的串行处理，PPG模块同时利用CLIP和SAM的特征进行点提示生成
智能负点选择：通过膨胀函数和特征相似度计算，选择更有效的负点提示，避免SAM分割整个对象
渐进式约束增强：CPS模块通过三级级联逐步增强对SAM的约束，实现精确分割
轻量化设计：仅使用SAM的轻量级解码器进行迭代优化，额外计算开销仅100毫秒

实验设置

数据集

MVTec-AD：包含高分辨率工业对象图像，具有完整的像素级标注
VisA：工业异常检测数据集，包含多种异常类型

评价指标

AUROC：反映模型在不同阈值水平下区分类别的能力
F1-max：最优阈值下精确率和召回率的调和平均数
AP (Average Precision)：不同召回率水平下的精确度

对比方法

CLIP-based方法：WinCLIP, APRIL-GAN, SDP, SDP+, AnomalyCLIP
SAM-based方法：SAA, SAA+
CLIP&SAM协作方法：ClipSAM

实现细节

CLIP模型：预训练的ViT-L-14-336模型
SAM模型：ViT-H预训练模型
优化器：Adam，学习率1e-3
训练设置：VisA数据集3个epoch，MVTec-AD数据集15个epoch
硬件：NVIDIA GeForce RTX 3090，批次大小16

实验结果

主要结果

方法类别	方法	MVTec-AD			VisA
		AUROC	F1-max	AP	AUROC	F1-max	AP
CLIP-based	WinCLIP	85.1	31.7	-	79.6	14.8	-
	APRIL-GAN	87.6	43.3	40.8	94.2	32.3	25.7
	AnomalyCLIP	91.1	39.1	34.5	95.5	28.3	21.3
SAM-based	SAA+	73.2	37.8	28.8	74.0	27.1	22.4
CLIP&SAM	ClipSAM	92.3	47.8	45.9	95.6	33.1	26.0
本文	Ours	89.5	48.8	46.4	94.8	36.5	28.0