Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation
Park, Lee, Seong et al.
We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at https://github.com/SuhoPark0706/FCP
academic
Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation
This paper proposes the Foreground-Covering Prototype (FCP) generation and matching method to address the few-shot segmentation (FSS) problem. Unlike previous research that typically estimates target regions using support prototypes and query pixels, this work leverages the relationship between support and query prototypes. The method combines two complementary features: SAM image encoder features for pixel aggregation and ResNet features for category consistency. By constructing support and query prototypes and distinguishing target regions in query prototypes based on ResNet features, object masks are ultimately generated through the SAM mask decoder, achieving state-of-the-art performance on multiple datasets.
Few-Shot Segmentation (FSS) aims to segment target regions in unlabeled query images based on a limited number of labeled support images. This is an important task in computer vision because traditional semantic segmentation methods require large amounts of labeled data, while FSS can significantly reduce the burden of manual annotation.
Limitations of SAM: Although the Segment Anything Model (SAM) demonstrates excellent performance in segmentation tasks, it lacks cross-image category consistency and cannot classify foreground regions in query images based on support images.
Insufficiencies of VRP-SAM:
Suboptimal prototype-pixel matching relationships may result in visual reference prompts lacking sufficient foreground information or containing background elements
Low quality of pseudo-masks based on simple pixel-to-pixel similarity
Difficulty in selectively enhancing query foreground pixels, potentially blurring the distinction between foreground and background pixels
This work observes that SAM image encoder features excel at pixel-level aggregation, while ResNet features are stronger in category consistency. Based on this observation, a prototype-to-prototype matching strategy is proposed to generate more reliable visual reference prompts.
Proposes the Foreground-Covering Prototype generation and matching method: Constructs prototypes for support and query images, generates visual reference prompts through prototype comparison, and produces object masks for query images through the SAM mask decoder.
Dual-feature fusion strategy: Effectively leverages the superior aggregation capability of SAM image encoder features and the category consistency of ResNet features to generate foreground-centered prototypes.
Attention-guided pseudo-masks: Proposes attention-based pseudo-masks that effectively replace traditional pseudo-masks by utilizing SAM image encoder features.
Achieves state-of-the-art performance: Validates the effectiveness of prototype-to-prototype matching on multiple datasets, achieving new state-of-the-art results.
FSS employs a meta-learning approach using two separate datasets: training set D_train and test set D_test, containing non-overlapping classes C_base and C_novel. Each episode contains:
Support set: K labeled images S = {(I_Si, M_Si)}^K_i=1
Query set: One unlabeled image Q = (I_Q, M_Q)
The objective is to predict the query mask M_pred based on the support set and query image.
Experimental results on PASCAL-5i and COCO-20i datasets demonstrate that the proposed method achieves state-of-the-art performance across all settings:
PASCAL-5i Dataset (ResNet-50):
1-shot: 73.2% mIoU (1.4% improvement over VRP-SAM's 71.8%)
5-shot: 74.0% mIoU (2.6% improvement over VRP-SAM's 71.4%)
COCO-20i Dataset (ResNet-50):
1-shot: 52.5% mIoU (2.3% improvement over VRP-SAM's 50.2%)
5-shot: 58.0% mIoU (2.5% improvement over VRP-SAM's 55.5%)
SAM serves as a foundation model in the segmentation domain with prompt-based design and strong zero-shot capabilities, but lacks cross-image category consistency.
The paper cites important works in related fields such as few-shot segmentation and vision foundation models, including classical methods such as SAM, VRP-SAM, PFENet, and CyCTR, providing a solid theoretical foundation for this research.