2025-11-10T02:42:11.024249

Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Park, Lee, Seong et al.
We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at https://github.com/SuhoPark0706/FCP
academic

Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Basic Information

  • Paper ID: 2501.00752
  • Title: Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation
  • Authors: Suho Park*, SuBeen Lee*, Hyun Seok Seong, Jaejoon Yoo, Jae-Pil Heo† (Sungkyunkwan University)
  • Category: cs.CV (Computer Vision)
  • Submission Date: January 1, 2025 to arXiv
  • Paper Link: https://arxiv.org/abs/2501.00752
  • Code Link: https://github.com/SuhoPark0706/FCP

Abstract

This paper proposes the Foreground-Covering Prototype (FCP) generation and matching method to address the few-shot segmentation (FSS) problem. Unlike previous research that typically estimates target regions using support prototypes and query pixels, this work leverages the relationship between support and query prototypes. The method combines two complementary features: SAM image encoder features for pixel aggregation and ResNet features for category consistency. By constructing support and query prototypes and distinguishing target regions in query prototypes based on ResNet features, object masks are ultimately generated through the SAM mask decoder, achieving state-of-the-art performance on multiple datasets.

Research Background and Motivation

Problem Definition

Few-Shot Segmentation (FSS) aims to segment target regions in unlabeled query images based on a limited number of labeled support images. This is an important task in computer vision because traditional semantic segmentation methods require large amounts of labeled data, while FSS can significantly reduce the burden of manual annotation.

Limitations of Existing Methods

  1. Limitations of SAM: Although the Segment Anything Model (SAM) demonstrates excellent performance in segmentation tasks, it lacks cross-image category consistency and cannot classify foreground regions in query images based on support images.
  2. Insufficiencies of VRP-SAM:
    • Suboptimal prototype-pixel matching relationships may result in visual reference prompts lacking sufficient foreground information or containing background elements
    • Low quality of pseudo-masks based on simple pixel-to-pixel similarity
    • Difficulty in selectively enhancing query foreground pixels, potentially blurring the distinction between foreground and background pixels

Research Motivation

This work observes that SAM image encoder features excel at pixel-level aggregation, while ResNet features are stronger in category consistency. Based on this observation, a prototype-to-prototype matching strategy is proposed to generate more reliable visual reference prompts.

Core Contributions

  1. Proposes the Foreground-Covering Prototype generation and matching method: Constructs prototypes for support and query images, generates visual reference prompts through prototype comparison, and produces object masks for query images through the SAM mask decoder.
  2. Dual-feature fusion strategy: Effectively leverages the superior aggregation capability of SAM image encoder features and the category consistency of ResNet features to generate foreground-centered prototypes.
  3. Attention-guided pseudo-masks: Proposes attention-based pseudo-masks that effectively replace traditional pseudo-masks by utilizing SAM image encoder features.
  4. Achieves state-of-the-art performance: Validates the effectiveness of prototype-to-prototype matching on multiple datasets, achieving new state-of-the-art results.

Methodology Details

Task Definition

FSS employs a meta-learning approach using two separate datasets: training set D_train and test set D_test, containing non-overlapping classes C_base and C_novel. Each episode contains:

  • Support set: K labeled images S = {(I_Si, M_Si)}^K_i=1
  • Query set: One unlabeled image Q = (I_Q, M_Q)

The objective is to predict the query mask M_pred based on the support set and query image.

Model Architecture

1. Support Prototype Generation

The support prototype generation process includes two main steps:

Foreground Feature Aggregation:

Ḡ_S = ConvG(Concat(G_S, M_S, MP(G_S, M_S)))  (1)

Ground truth mask M_S guides SAM features G_S, followed by T-1 steps of iterative cross-attention aggregation of foreground information:

P_S^t = MaskedCrossAttn(P_S^{t-1}, Ḡ_S, Ḡ_S; M_S)  (2)

Category Consistency Injection:

F̄_S = ConvG(Concat(F_S, M_S, MP(F_S, M_S)))  (3)
P_S^T = MaskedCrossAttn(P_S^{T-1}, Ḡ_S, F̄_S; M_S)  (4)

2. Query Prototype Generation

Query prototype generation faces the challenge of lacking ground truth masks, employing the following strategy:

Traditional Pseudo-mask Computation:

M^pseudo_{h,w} = max_{1≤h'≤H,1≤w'≤W} M_S^{h',w'}(F_Q^{h,w} · F_S^{h',w'})  (5)

SAM Feature Aggregation:

Ḡ_Q = ConvG(Concat(G_Q, M^pseudo, MP(G_S, M_S)))  (6)
P_Q^t = CrossAttn(P_Q^{t-1}, Ḡ_Q, Ḡ_Q)  (7)

Attention-guided Pseudo-mask:

M^attn_{t,h,w} = max_{1≤n≤N} A_Q^{t,n,h,w}  (8)

Guidance Loss:

L_guide = 1/(T-1) ∑_{t=1}^{T-1} L_BCE(M^attn_t, M_Q) + L_DL(M^attn_t, M_Q)  (9)

ResNet Feature Fusion:

F̄_Q = ConvF(Concat(F_Q, M^attn_{T-1}, MP(F_S, M_S)))  (10)
P_Q^T = CrossAttn(P_Q^{T-1}, Ḡ_Q, F̄_Q)  (11)

3. Prototype-to-Prototype Matching

Visual reference prompts are generated through cross-attention:

V = CrossAttn(P_S^T, P_Q^T, P_Q^T)  (12)

Loss Functions

The total loss comprises three components:

L_total = L_prompt + λ_ortho L_ortho + λ_guide L_guide  (15)
  • Prompt Loss: L_prompt = L_BCE(M_pred, M_Q) + L_DL(M_pred, M_Q)
  • Orthogonal Loss: Ensures different prototypes encode different information
  • Guidance Loss: Guides attention to focus on foreground regions

Experimental Setup

Datasets

  • PASCAL-5i: 20 classes from PASCAL VOC 2012 and SDS, divided into 4 folds, each containing 15 base classes and 5 novel classes
  • COCO-20i: 80 classes from the COCO dataset, divided into 4 folds, each containing 60 base classes and 20 novel classes

Evaluation Metrics

Mean Intersection over Union (mIoU) is used to evaluate performance, with 1000 randomly sampled support-query pairs tested on novel classes.

Implementation Details

  • Optimizer: AdamW with cosine annealing schedule
  • PASCAL-5i: 100 epochs, learning rate 2e-4
  • COCO-20i: 50 epochs, learning rate 1e-4
  • Batch size: 8
  • Number of learnable tokens: 50
  • Number of aggregation layers: T=3
  • Loss coefficients: λ_ortho=0.05, λ_guide=0.5

Experimental Results

Main Results

Experimental results on PASCAL-5i and COCO-20i datasets demonstrate that the proposed method achieves state-of-the-art performance across all settings:

PASCAL-5i Dataset (ResNet-50):

  • 1-shot: 73.2% mIoU (1.4% improvement over VRP-SAM's 71.8%)
  • 5-shot: 74.0% mIoU (2.6% improvement over VRP-SAM's 71.4%)

COCO-20i Dataset (ResNet-50):

  • 1-shot: 52.5% mIoU (2.3% improvement over VRP-SAM's 50.2%)
  • 5-shot: 58.0% mIoU (2.5% improvement over VRP-SAM's 55.5%)

Ablation Study

Main Component Analysis:

  • ResNet features only (baseline): 71.8% mIoU
  • Adding prototype-to-prototype matching: 72.6% mIoU (+0.8%)
  • Adding attention-guided pseudo-masks: 73.2% mIoU (+1.4%)

Impact of Aggregation Steps T:

  • Optimal performance achieved at T=3
  • Excessive steps lead to performance degradation due to token over-focusing on smaller regions

Loss Function Effectiveness:

  • Prompt loss only: 72.3% mIoU
  • Adding guidance loss: 72.7% mIoU (+0.4%)
  • Adding orthogonal loss: 72.4% mIoU (+0.1%)
  • All losses: 73.2% mIoU (+0.9%)

Pseudo-mask Quality Analysis

Attention-guided pseudo-masks show significant improvements over traditional pseudo-masks:

  • mIoU: 60.9% vs 32.4%
  • Precision: 69.1% vs 46.5%
  • Recall: 79.4% vs 53.6%

Vision Foundation Models

SAM serves as a foundation model in the segmentation domain with prompt-based design and strong zero-shot capabilities, but lacks cross-image category consistency.

Few-Shot Segmentation Methods

Primarily divided into two categories:

  1. Prototype-based methods: Represent support foreground as prototypes for prediction
  2. Affinity learning methods: Utilize dense pixel-level correlations between support and query images

VRP-SAM introduces methods for generating appropriate prompts for the SAM mask decoder, but pixel-level comparison has limitations.

Conclusion and Discussion

Main Conclusions

  1. Prototype-to-prototype matching is more effective than prototype-to-pixel matching
  2. SAM feature aggregation capability and ResNet feature category consistency are complementary
  3. Attention-guided pseudo-masks significantly outperform traditional pseudo-masks
  4. State-of-the-art performance is achieved on multiple datasets

Limitations

  1. Dependency on two pretrained models (SAM and ResNet) increases computational complexity
  2. Method effectiveness is primarily validated on natural images; generalization to other domains requires further investigation
  3. Hyperparameters (such as T and λ values) require adjustment for different datasets

Future Directions

  1. Explore more lightweight feature fusion strategies
  2. Investigate applications in specific domains such as medical imaging
  3. Further improve the efficiency and accuracy of attention mechanisms

In-Depth Evaluation

Strengths

  1. Strong technical innovation: Proposes a novel prototype-to-prototype matching paradigm that effectively leverages complementary features
  2. Comprehensive experiments: Conducts thorough experimental validation across multiple datasets and settings
  3. In-depth analysis: Clearly demonstrates method effectiveness through visualization and quantitative analysis
  4. Clear writing: Well-structured paper with accurate technical descriptions

Weaknesses

  1. Computational complexity: Simultaneous use of SAM and ResNet features may increase inference time
  2. Parameter sensitivity: Multiple hyperparameter settings may affect method stability
  3. Generalization capability: Primarily validated on natural image datasets; effectiveness on other domains remains unknown

Impact

  1. Academic contribution: Provides a new technical pathway for few-shot segmentation that may inspire subsequent research
  2. Practical value: Can reduce annotation costs in practical applications with high application potential
  3. Reproducibility: Provides detailed implementation details and open-source code for easy reproduction and improvement

Applicable Scenarios

  1. Segmentation tasks requiring rapid adaptation to new classes
  2. Application scenarios with scarce annotated data
  3. Computer vision applications with high segmentation accuracy requirements

References

The paper cites important works in related fields such as few-shot segmentation and vision foundation models, including classical methods such as SAM, VRP-SAM, PFENet, and CyCTR, providing a solid theoretical foundation for this research.