2025-11-16T00:28:11.703942

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

Schön, Lorenz, Kienzle et al.
In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.
academic

SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

Basic Information

  • Paper ID: 2501.07960
  • Title: SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts
  • Authors: Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart
  • Affiliation: University of Augsburg, Germany
  • Category: cs.CV (Computer Vision)
  • Publication Date: January 2025
  • Paper Link: https://arxiv.org/abs/2501.07960

Abstract

This paper proposes SkipClick, a novel interactive segmentation architecture specifically designed for winter sports scenes. Interactive segmentation predicts high-quality segmentation masks guided by user input, utilizing click prompts as the guidance mechanism. The authors first present a baseline architecture optimized for rapid response after clicks, then describe multiple architectural improvements to enhance performance on the WSESeg dataset for segmenting winter sports equipment. On the average NoC@85 metric for WSESeg categories, the method reduces clicks by 2.336 and 7.946 compared to SAM and HQ-SAM respectively. On the HQSeg-44k dataset, the system achieves state-of-the-art results with NoC@90 of 6.00 and NoC@95 of 9.89. Additionally, the model is evaluated on a newly proposed ski person segmentation dataset.

Research Background and Motivation

Problem Definition

  1. Core Problem: Precise localization of athletes and related equipment in winter sports scenes is required, with equipment segmentation becoming increasingly important
  2. Annotation Challenges: Segmentation mask annotation is time-consuming and difficult, particularly for fine-grained structures
  3. Domain Specificity: Winter sports equipment appears less frequently in generic datasets, presenting domain adaptation challenges

Significance

  • Growing demand for precise equipment localization in sports analysis
  • Interactive segmentation can significantly reduce manual annotation time
  • Winter sports scenes exhibit unique visual characteristics (snowy environments, fine equipment structures)

Limitations of Existing Methods

  1. SAM's Limitations: Despite training on the SA-1B dataset (1.1 billion masks), generalization to winter sports equipment domain is insufficient
  2. Response Time: Early fusion methods require rerunning the entire network, resulting in slow responses
  3. Detail Handling: Existing methods struggle with fine structures in winter sports equipment

Core Contributions

  1. Real-time Interactive Segmentation Model: Proposes a real-time model capable of segmentation in specialized domains such as winter sports, with particular focus on handling fine-grained structures in images
  2. Architectural Innovation: Validates model performance on the WSESeg dataset through ablation studies, even surpassing SAM trained on larger datasets
  3. Generalization Capability: Demonstrates that the model does not overfit to the winter sports domain, showing competitive performance on generic consumer image datasets
  4. New Dataset: Introduces SHSeg (Ski Human Segmentation) dataset containing 534 segmentation masks and 496 images

Methodology

Task Definition

Interactive segmentation is defined as: given an image ximgRH×W×3x_{img} \in \mathbb{R}^{H×W×3}, the goal is to create a high-quality segmentation mask m{0,1}H×Wm \in \{0,1\}^{H×W}, where 1 represents target objects and 0 represents background.

Users provide guidance through iterative interactions:

  1. User examines current mask mτm_τ
  2. Places a click pτ=(iτ,jτ,lτ)p_τ = (i_τ, j_τ, l_τ), where (iτ,jτ)(i_τ, j_τ) are coordinates and lτ{+,}l_τ \in \{+,-\} is the foreground/background label
  3. Network generates improved mask mτ+1m_{τ+1} based on ximgx_{img}, mτm_τ, and accumulated clicks p0:τp_{0:τ}

Model Architecture

Baseline Architecture

  1. Backbone Network: Uses DINOv2 pre-trained ViT-B to avoid annotation data bias
  2. Image Feature Extraction: fimg=Linear(ViTBackbone(ximg))RH14×W14×dmodelf_{img} = \text{Linear}(\text{ViTBackbone}(x_{img})) \in \mathbb{R}^{\frac{H}{14}×\frac{W}{14}×d_{model}}
  3. Prompt Encoding: Encodes positive and negative clicks as disks with radius 5 pixels, generating click maps m+,mm^+, m^-fprompt=PatchEmbedding(Concat(m+,m,mτ))f_{prompt} = \text{PatchEmbedding}(\text{Concat}(m^+, m^-, m_τ))
  4. Feature Fusion: fmix=fimg+fpromptf_{mix} = f_{img} + f_{prompt}f^mix=ViTBlocks(fmix)\hat{f}_{mix} = \text{ViTBlocks}(f_{mix})
  5. Mask Decoding: Uses FPN and SegFormer decoder to generate final mask

Complete SkipClick Architecture

  1. Frozen Backbone Network: Prevents overfitting and maintains generalization capability
  2. Multi-layer Feature Fusion: Utilizes features from layers 3, 6, 9, 12 of ViT f1,f2,f3,f4=ViTBackbone(ximg)f_1, f_2, f_3, f_4 = \text{ViTBackbone}(x_{img})fimg=Linear(Concat(f1,f2,f3,f4))f_{img} = \text{Linear}(\text{Concat}(f_1, f_2, f_3, f_4))
  3. Skip Connections: U-Net-like design f^i=Concat(f^mix,fi) for i=1,2,3,4\hat{f}_i = \text{Concat}(\hat{f}_{mix}, f_i) \text{ for } i = 1,2,3,4

Technical Innovations

  1. Late Fusion Strategy: Image encoding executes only once; after interaction, only lightweight mask predictor runs
  2. Multi-scale Feature Integration: Combines features from different levels to preserve fine-grained information
  3. Skip Connection Design: Maintains access to intermediate features after prompt integration for handling fine structures
  4. Freezing Strategy: Preserves pre-trained model generalization by freezing backbone network

Experimental Setup

Datasets

  1. Training Data: Combined COCO+LVIS dataset (99k images, 1.5 million masks)
  2. Evaluation Datasets:
    • WSESeg: 7,452 masks, 10 winter sports equipment categories
    • SHSeg: 534 skier masks, 496 images (newly proposed)
    • HQSeg-44k: High-quality annotated dataset
    • Generic Datasets: GrabCut, Berkeley, DAVIS, SBD

Evaluation Metrics

  • NoC@θ: Number of clicks required to achieve IoU threshold θ
  • Primary Metrics: NoC@85, NoC@90, NoC@95
  • Upper Limit: Maximum 20 clicks

Implementation Details

  • Optimizer: Adam (lr=5×10⁻⁵, β₁=0.9, β₂=0.999)
  • Loss Function: Focal Loss
  • Training: 55 epochs, 30,000 images per epoch
  • Resolution: 896×896 for WSESeg/SHSeg/HQSeg-44k, 672×672 for DAVIS
  • Random Sampling: Up to 24 initial random points, 3 training iterations

Experimental Results

Main Results

WSESeg Dataset Performance

MethodNoC@85NoC@90
SAM8.8311.86
HQ-SAM14.4416.31
SkipClick6.499.16
  • Reduces clicks by 2.336 compared to SAM (NoC@85)
  • Reduces clicks by 7.946 compared to HQ-SAM (NoC@85)

State-of-the-Art on HQSeg-44k

MethodNoC@90NoC@95
HQ-SAM6.4910.79
SkipClick6.009.89

Response Time Comparison

  • SkipClick: 6.61ms (fastest)
  • SAM: 15.01ms
  • HQ-SAM: 18.83ms
  • SAM + Schön et al.: 41.38ms

Ablation Studies

ConfigurationWSESeg Avg NoC@85WSESeg Avg NoC@90
Baseline9.46312.031
+Frozen Backbone9.41611.951
+Intermediate Features7.28510.344
+Skip Connections6.4949.163

Key Findings:

  1. Frozen Backbone Network: Marginal improvement (9.463→9.416)
  2. Intermediate Feature Fusion: Significant improvement (9.416→7.285)
  3. Skip Connections: Further improvement (7.285→6.494)

Generalization Capability Verification

Performance on generic datasets demonstrates the model does not overfit to winter sports domain:

DatasetComplete SkipClick NoC@90
GrabCut1.44
Berkeley2.45
DAVIS4.94
SBD6.18

Sports Segmentation Applications

  • Soccer and basketball player segmentation 3,9
  • Fencing blade tip tracking and segmentation 40
  • Ski equipment keypoint detection 31,32

Interactive Segmentation Development

  1. Early Fusion Methods: RITM 44, FocalClick 2, SimpleClick 28 - High quality but slow response
  2. Late Fusion Methods: SAM 20, InterFormer 15 - Fast response but potentially sacrificing quality
  3. Domain Adaptation: Online adaptation methods 22,23,41,42

Conclusions and Discussion

Main Conclusions

  1. SkipClick significantly outperforms SAM and HQ-SAM on winter sports equipment segmentation tasks
  2. Multi-layer feature fusion and skip connections are crucial for handling fine structures
  3. Freezing pre-trained backbone network helps maintain generalization capability
  4. Competitive performance on generic datasets demonstrates good generalization

Limitations

  1. Dataset Scale: Training data is smaller compared to SAM's SA-1B dataset
  2. Domain Specificity: Although generalization is demonstrated, optimization primarily targets winter sports scenes
  3. Computational Resources: Requires ViT-B backbone network with certain computational demands

Future Directions

  1. Extend to segmentation tasks in more sports domains
  2. Explore more lightweight architecture designs
  3. Investigate more efficient user interaction methods

In-Depth Evaluation

Strengths

  1. High Practical Value: Addresses the balance between response speed and segmentation quality in real applications
  2. Technical Innovation: Cleverly combines multi-layer features and skip connections to effectively handle fine structures
  3. Comprehensive Experiments: Includes detailed ablation studies and multi-dataset validation
  4. Dataset Contribution: SHSeg dataset fills the gap in skier segmentation
  5. Generalization Verification: Validates method universality across multiple generic datasets

Weaknesses

  1. Theoretical Analysis: Lacks in-depth theoretical analysis of why multi-layer feature fusion is effective
  2. User Studies: Lacks evaluation of real user experience
  3. Edge Cases: Insufficient analysis of performance under extreme weather or lighting conditions
  4. Limited Comparisons: Primarily compares with SAM family, lacks comparison with other late fusion methods

Impact

  1. Academic Value: Provides effective solutions for interactive segmentation in specific domains
  2. Practical Value: Direct applicability in sports analysis, video annotation, and other applications
  3. Reproducibility: Provides detailed implementation details and code availability commitment

Applicable Scenarios

  1. Sports Video Analysis: Particularly suitable for precise segmentation of winter sports equipment and personnel
  2. Video Annotation Tools: Can be integrated into video annotation systems to improve efficiency
  3. Fine-grained Structure Segmentation: Applicable to segmentation tasks requiring complex boundary handling
  4. Real-time Applications: Fast response characteristics make it suitable for interactive applications

References

The paper cites 46 related references, primarily including:

  • 20 SAM: Segment Anything Model
  • 18 HQ-SAM: Segment Anything in High Quality
  • 28 SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
  • 41 WSESeg dataset related work
  • 44 RITM: Reviving Iterative Training with Mask Guidance

Overall Assessment: This is a high-quality computer vision paper that proposes an effective interactive segmentation solution for the specific yet important application scenario of winter sports. The technical approach is sound, experimental validation is comprehensive, and it demonstrates good practical value and academic contribution.