2025-11-16T00:28:11.703942

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

SchÃ¶n, Lorenz, Kienzle et al.

In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.

academic

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

Basic Information

Paper ID: 2501.07960
Title: SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts
Authors: Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart
Affiliation: University of Augsburg, Germany
Category: cs.CV (Computer Vision)
Publication Date: January 2025
Paper Link: https://arxiv.org/abs/2501.07960

Abstract

This paper proposes SkipClick, a novel interactive segmentation architecture specifically designed for winter sports scenes. Interactive segmentation predicts high-quality segmentation masks guided by user input, utilizing click prompts as the guidance mechanism. The authors first present a baseline architecture optimized for rapid response after clicks, then describe multiple architectural improvements to enhance performance on the WSESeg dataset for segmenting winter sports equipment. On the average NoC@85 metric for WSESeg categories, the method reduces clicks by 2.336 and 7.946 compared to SAM and HQ-SAM respectively. On the HQSeg-44k dataset, the system achieves state-of-the-art results with NoC@90 of 6.00 and NoC@95 of 9.89. Additionally, the model is evaluated on a newly proposed ski person segmentation dataset.

Research Background and Motivation

Problem Definition

Core Problem: Precise localization of athletes and related equipment in winter sports scenes is required, with equipment segmentation becoming increasingly important
Annotation Challenges: Segmentation mask annotation is time-consuming and difficult, particularly for fine-grained structures
Domain Specificity: Winter sports equipment appears less frequently in generic datasets, presenting domain adaptation challenges

Significance

Growing demand for precise equipment localization in sports analysis
Interactive segmentation can significantly reduce manual annotation time
Winter sports scenes exhibit unique visual characteristics (snowy environments, fine equipment structures)

Limitations of Existing Methods

SAM's Limitations: Despite training on the SA-1B dataset (1.1 billion masks), generalization to winter sports equipment domain is insufficient
Response Time: Early fusion methods require rerunning the entire network, resulting in slow responses
Detail Handling: Existing methods struggle with fine structures in winter sports equipment

Core Contributions

Real-time Interactive Segmentation Model: Proposes a real-time model capable of segmentation in specialized domains such as winter sports, with particular focus on handling fine-grained structures in images
Architectural Innovation: Validates model performance on the WSESeg dataset through ablation studies, even surpassing SAM trained on larger datasets
Generalization Capability: Demonstrates that the model does not overfit to the winter sports domain, showing competitive performance on generic consumer image datasets
New Dataset: Introduces SHSeg (Ski Human Segmentation) dataset containing 534 segmentation masks and 496 images

Methodology

Task Definition

Interactive segmentation is defined as: given an image $x_{img} \in \mathbb{R}^{H×W×3}$ , the goal is to create a high-quality segmentation mask $m \in \{0,1\}^{H×W}$ , where 1 represents target objects and 0 represents background.

Users provide guidance through iterative interactions:

User examines current mask $m_τ$
Places a click $p_τ = (i_τ, j_τ, l_τ)$ , where $(i_τ, j_τ)$ are coordinates and $l_τ \in \{+,-\}$ is the foreground/background label
Network generates improved mask $m_{τ+1}$ based on $x_{img}$ , $m_τ$ , and accumulated clicks $p_{0:τ}$

Model Architecture

Baseline Architecture

Backbone Network: Uses DINOv2 pre-trained ViT-B to avoid annotation data bias
Image Feature Extraction: $f_{img} = \text{Linear}(\text{ViTBackbone}(x_{img})) \in \mathbb{R}^{\frac{H}{14}×\frac{W}{14}×d_{model}}$
Prompt Encoding: Encodes positive and negative clicks as disks with radius 5 pixels, generating click maps $m^+, m^-$ $f_{prompt} = \text{PatchEmbedding}(\text{Concat}(m^+, m^-, m_τ))$
Feature Fusion: $f_{mix} = f_{img} + f_{prompt}$ $\hat{f}_{mix} = \text{ViTBlocks}(f_{mix})$
Mask Decoding: Uses FPN and SegFormer decoder to generate final mask

Complete SkipClick Architecture

Frozen Backbone Network: Prevents overfitting and maintains generalization capability
Multi-layer Feature Fusion: Utilizes features from layers 3, 6, 9, 12 of ViT $f_1, f_2, f_3, f_4 = \text{ViTBackbone}(x_{img})$ $f_{img} = \text{Linear}(\text{Concat}(f_1, f_2, f_3, f_4))$
Skip Connections: U-Net-like design $\hat{f}_i = \text{Concat}(\hat{f}_{mix}, f_i) \text{ for } i = 1,2,3,4$

Technical Innovations

Late Fusion Strategy: Image encoding executes only once; after interaction, only lightweight mask predictor runs
Multi-scale Feature Integration: Combines features from different levels to preserve fine-grained information
Skip Connection Design: Maintains access to intermediate features after prompt integration for handling fine structures
Freezing Strategy: Preserves pre-trained model generalization by freezing backbone network

Experimental Setup

Datasets

Training Data: Combined COCO+LVIS dataset (99k images, 1.5 million masks)
Evaluation Datasets:
- WSESeg: 7,452 masks, 10 winter sports equipment categories
- SHSeg: 534 skier masks, 496 images (newly proposed)
- HQSeg-44k: High-quality annotated dataset
- Generic Datasets: GrabCut, Berkeley, DAVIS, SBD

Evaluation Metrics

NoC@θ: Number of clicks required to achieve IoU threshold θ
Primary Metrics: NoC@85, NoC@90, NoC@95
Upper Limit: Maximum 20 clicks

Implementation Details

Optimizer: Adam (lr=5×10⁻⁵, β₁=0.9, β₂=0.999)
Loss Function: Focal Loss
Training: 55 epochs, 30,000 images per epoch
Resolution: 896×896 for WSESeg/SHSeg/HQSeg-44k, 672×672 for DAVIS
Random Sampling: Up to 24 initial random points, 3 training iterations

Experimental Results

Main Results

WSESeg Dataset Performance

Method	NoC@85	NoC@90
SAM	8.83	11.86
HQ-SAM	14.44	16.31
SkipClick	6.49	9.16

Reduces clicks by 2.336 compared to SAM (NoC@85)
Reduces clicks by 7.946 compared to HQ-SAM (NoC@85)

State-of-the-Art on HQSeg-44k

Method	NoC@90	NoC@95
HQ-SAM	6.49	10.79
SkipClick	6.00	9.89

Response Time Comparison

SkipClick: 6.61ms (fastest)
SAM: 15.01ms
HQ-SAM: 18.83ms
SAM + Schön et al.: 41.38ms

Ablation Studies

Configuration	WSESeg Avg NoC@85	WSESeg Avg NoC@90
Baseline	9.463	12.031
+Frozen Backbone	9.416	11.951
+Intermediate Features	7.285	10.344
+Skip Connections	6.494	9.163

Key Findings:

Frozen Backbone Network: Marginal improvement (9.463→9.416)
Intermediate Feature Fusion: Significant improvement (9.416→7.285)
Skip Connections: Further improvement (7.285→6.494)

Generalization Capability Verification

Performance on generic datasets demonstrates the model does not overfit to winter sports domain:

Dataset	Complete SkipClick NoC@90
GrabCut	1.44
Berkeley	2.45
DAVIS	4.94
SBD	6.18

Sports Segmentation Applications

Soccer and basketball player segmentation 3,9
Fencing blade tip tracking and segmentation 40
Ski equipment keypoint detection 31,32

Interactive Segmentation Development

Early Fusion Methods: RITM 44, FocalClick 2, SimpleClick 28 - High quality but slow response
Late Fusion Methods: SAM 20, InterFormer 15 - Fast response but potentially sacrificing quality
Domain Adaptation: Online adaptation methods 22,23,41,42

Conclusions and Discussion

Main Conclusions

SkipClick significantly outperforms SAM and HQ-SAM on winter sports equipment segmentation tasks
Multi-layer feature fusion and skip connections are crucial for handling fine structures
Freezing pre-trained backbone network helps maintain generalization capability
Competitive performance on generic datasets demonstrates good generalization

Limitations

Dataset Scale: Training data is smaller compared to SAM's SA-1B dataset
Domain Specificity: Although generalization is demonstrated, optimization primarily targets winter sports scenes
Computational Resources: Requires ViT-B backbone network with certain computational demands

Future Directions

Extend to segmentation tasks in more sports domains
Explore more lightweight architecture designs
Investigate more efficient user interaction methods

In-Depth Evaluation

Strengths

High Practical Value: Addresses the balance between response speed and segmentation quality in real applications
Technical Innovation: Cleverly combines multi-layer features and skip connections to effectively handle fine structures
Comprehensive Experiments: Includes detailed ablation studies and multi-dataset validation
Dataset Contribution: SHSeg dataset fills the gap in skier segmentation
Generalization Verification: Validates method universality across multiple generic datasets

Weaknesses

Theoretical Analysis: Lacks in-depth theoretical analysis of why multi-layer feature fusion is effective
User Studies: Lacks evaluation of real user experience
Edge Cases: Insufficient analysis of performance under extreme weather or lighting conditions
Limited Comparisons: Primarily compares with SAM family, lacks comparison with other late fusion methods

Impact

Academic Value: Provides effective solutions for interactive segmentation in specific domains
Practical Value: Direct applicability in sports analysis, video annotation, and other applications
Reproducibility: Provides detailed implementation details and code availability commitment

Applicable Scenarios

Sports Video Analysis: Particularly suitable for precise segmentation of winter sports equipment and personnel
Video Annotation Tools: Can be integrated into video annotation systems to improve efficiency
Fine-grained Structure Segmentation: Applicable to segmentation tasks requiring complex boundary handling
Real-time Applications: Fast response characteristics make it suitable for interactive applications

References

The paper cites 46 related references, primarily including:

20 SAM: Segment Anything Model
18 HQ-SAM: Segment Anything in High Quality
28 SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
41 WSESeg dataset related work
44 RITM: Reviving Iterative Training with Mask Guidance

Overall Assessment: This is a high-quality computer vision paper that proposes an effective interactive segmentation solution for the specific yet important application scenario of winter sports. The technical approach is sound, experimental validation is comprehensive, and it demonstrates good practical value and academic contribution.