In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.
SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts
- Paper ID: 2501.07960
- Title: SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts
- Authors: Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart
- Affiliation: University of Augsburg, Germany
- Category: cs.CV (Computer Vision)
- Publication Date: January 2025
- Paper Link: https://arxiv.org/abs/2501.07960
This paper proposes SkipClick, a novel interactive segmentation architecture specifically designed for winter sports scenes. Interactive segmentation predicts high-quality segmentation masks guided by user input, utilizing click prompts as the guidance mechanism. The authors first present a baseline architecture optimized for rapid response after clicks, then describe multiple architectural improvements to enhance performance on the WSESeg dataset for segmenting winter sports equipment. On the average NoC@85 metric for WSESeg categories, the method reduces clicks by 2.336 and 7.946 compared to SAM and HQ-SAM respectively. On the HQSeg-44k dataset, the system achieves state-of-the-art results with NoC@90 of 6.00 and NoC@95 of 9.89. Additionally, the model is evaluated on a newly proposed ski person segmentation dataset.
- Core Problem: Precise localization of athletes and related equipment in winter sports scenes is required, with equipment segmentation becoming increasingly important
- Annotation Challenges: Segmentation mask annotation is time-consuming and difficult, particularly for fine-grained structures
- Domain Specificity: Winter sports equipment appears less frequently in generic datasets, presenting domain adaptation challenges
- Growing demand for precise equipment localization in sports analysis
- Interactive segmentation can significantly reduce manual annotation time
- Winter sports scenes exhibit unique visual characteristics (snowy environments, fine equipment structures)
- SAM's Limitations: Despite training on the SA-1B dataset (1.1 billion masks), generalization to winter sports equipment domain is insufficient
- Response Time: Early fusion methods require rerunning the entire network, resulting in slow responses
- Detail Handling: Existing methods struggle with fine structures in winter sports equipment
- Real-time Interactive Segmentation Model: Proposes a real-time model capable of segmentation in specialized domains such as winter sports, with particular focus on handling fine-grained structures in images
- Architectural Innovation: Validates model performance on the WSESeg dataset through ablation studies, even surpassing SAM trained on larger datasets
- Generalization Capability: Demonstrates that the model does not overfit to the winter sports domain, showing competitive performance on generic consumer image datasets
- New Dataset: Introduces SHSeg (Ski Human Segmentation) dataset containing 534 segmentation masks and 496 images
Interactive segmentation is defined as: given an image ximg∈RH×W×3, the goal is to create a high-quality segmentation mask m∈{0,1}H×W, where 1 represents target objects and 0 represents background.
Users provide guidance through iterative interactions:
- User examines current mask mτ
- Places a click pτ=(iτ,jτ,lτ), where (iτ,jτ) are coordinates and lτ∈{+,−} is the foreground/background label
- Network generates improved mask mτ+1 based on ximg, mτ, and accumulated clicks p0:τ
- Backbone Network: Uses DINOv2 pre-trained ViT-B to avoid annotation data bias
- Image Feature Extraction:
fimg=Linear(ViTBackbone(ximg))∈R14H×14W×dmodel
- Prompt Encoding: Encodes positive and negative clicks as disks with radius 5 pixels, generating click maps m+,m−fprompt=PatchEmbedding(Concat(m+,m−,mτ))
- Feature Fusion:
fmix=fimg+fpromptf^mix=ViTBlocks(fmix)
- Mask Decoding: Uses FPN and SegFormer decoder to generate final mask
- Frozen Backbone Network: Prevents overfitting and maintains generalization capability
- Multi-layer Feature Fusion: Utilizes features from layers 3, 6, 9, 12 of ViT
f1,f2,f3,f4=ViTBackbone(ximg)fimg=Linear(Concat(f1,f2,f3,f4))
- Skip Connections: U-Net-like design
f^i=Concat(f^mix,fi) for i=1,2,3,4
- Late Fusion Strategy: Image encoding executes only once; after interaction, only lightweight mask predictor runs
- Multi-scale Feature Integration: Combines features from different levels to preserve fine-grained information
- Skip Connection Design: Maintains access to intermediate features after prompt integration for handling fine structures
- Freezing Strategy: Preserves pre-trained model generalization by freezing backbone network
- Training Data: Combined COCO+LVIS dataset (99k images, 1.5 million masks)
- Evaluation Datasets:
- WSESeg: 7,452 masks, 10 winter sports equipment categories
- SHSeg: 534 skier masks, 496 images (newly proposed)
- HQSeg-44k: High-quality annotated dataset
- Generic Datasets: GrabCut, Berkeley, DAVIS, SBD
- NoC@θ: Number of clicks required to achieve IoU threshold θ
- Primary Metrics: NoC@85, NoC@90, NoC@95
- Upper Limit: Maximum 20 clicks
- Optimizer: Adam (lr=5×10⁻⁵, β₁=0.9, β₂=0.999)
- Loss Function: Focal Loss
- Training: 55 epochs, 30,000 images per epoch
- Resolution: 896×896 for WSESeg/SHSeg/HQSeg-44k, 672×672 for DAVIS
- Random Sampling: Up to 24 initial random points, 3 training iterations
| Method | NoC@85 | NoC@90 |
|---|
| SAM | 8.83 | 11.86 |
| HQ-SAM | 14.44 | 16.31 |
| SkipClick | 6.49 | 9.16 |
- Reduces clicks by 2.336 compared to SAM (NoC@85)
- Reduces clicks by 7.946 compared to HQ-SAM (NoC@85)
| Method | NoC@90 | NoC@95 |
|---|
| HQ-SAM | 6.49 | 10.79 |
| SkipClick | 6.00 | 9.89 |
- SkipClick: 6.61ms (fastest)
- SAM: 15.01ms
- HQ-SAM: 18.83ms
- SAM + Schön et al.: 41.38ms
| Configuration | WSESeg Avg NoC@85 | WSESeg Avg NoC@90 |
|---|
| Baseline | 9.463 | 12.031 |
| +Frozen Backbone | 9.416 | 11.951 |
| +Intermediate Features | 7.285 | 10.344 |
| +Skip Connections | 6.494 | 9.163 |
Key Findings:
- Frozen Backbone Network: Marginal improvement (9.463→9.416)
- Intermediate Feature Fusion: Significant improvement (9.416→7.285)
- Skip Connections: Further improvement (7.285→6.494)
Performance on generic datasets demonstrates the model does not overfit to winter sports domain:
| Dataset | Complete SkipClick NoC@90 |
|---|
| GrabCut | 1.44 |
| Berkeley | 2.45 |
| DAVIS | 4.94 |
| SBD | 6.18 |
- Soccer and basketball player segmentation 3,9
- Fencing blade tip tracking and segmentation 40
- Ski equipment keypoint detection 31,32
- Early Fusion Methods: RITM 44, FocalClick 2, SimpleClick 28 - High quality but slow response
- Late Fusion Methods: SAM 20, InterFormer 15 - Fast response but potentially sacrificing quality
- Domain Adaptation: Online adaptation methods 22,23,41,42
- SkipClick significantly outperforms SAM and HQ-SAM on winter sports equipment segmentation tasks
- Multi-layer feature fusion and skip connections are crucial for handling fine structures
- Freezing pre-trained backbone network helps maintain generalization capability
- Competitive performance on generic datasets demonstrates good generalization
- Dataset Scale: Training data is smaller compared to SAM's SA-1B dataset
- Domain Specificity: Although generalization is demonstrated, optimization primarily targets winter sports scenes
- Computational Resources: Requires ViT-B backbone network with certain computational demands
- Extend to segmentation tasks in more sports domains
- Explore more lightweight architecture designs
- Investigate more efficient user interaction methods
- High Practical Value: Addresses the balance between response speed and segmentation quality in real applications
- Technical Innovation: Cleverly combines multi-layer features and skip connections to effectively handle fine structures
- Comprehensive Experiments: Includes detailed ablation studies and multi-dataset validation
- Dataset Contribution: SHSeg dataset fills the gap in skier segmentation
- Generalization Verification: Validates method universality across multiple generic datasets
- Theoretical Analysis: Lacks in-depth theoretical analysis of why multi-layer feature fusion is effective
- User Studies: Lacks evaluation of real user experience
- Edge Cases: Insufficient analysis of performance under extreme weather or lighting conditions
- Limited Comparisons: Primarily compares with SAM family, lacks comparison with other late fusion methods
- Academic Value: Provides effective solutions for interactive segmentation in specific domains
- Practical Value: Direct applicability in sports analysis, video annotation, and other applications
- Reproducibility: Provides detailed implementation details and code availability commitment
- Sports Video Analysis: Particularly suitable for precise segmentation of winter sports equipment and personnel
- Video Annotation Tools: Can be integrated into video annotation systems to improve efficiency
- Fine-grained Structure Segmentation: Applicable to segmentation tasks requiring complex boundary handling
- Real-time Applications: Fast response characteristics make it suitable for interactive applications
The paper cites 46 related references, primarily including:
- 20 SAM: Segment Anything Model
- 18 HQ-SAM: Segment Anything in High Quality
- 28 SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
- 41 WSESeg dataset related work
- 44 RITM: Reviving Iterative Training with Mask Guidance
Overall Assessment: This is a high-quality computer vision paper that proposes an effective interactive segmentation solution for the specific yet important application scenario of winter sports. The technical approach is sound, experimental validation is comprehensive, and it demonstrates good practical value and academic contribution.