2025-11-24T15:22:16.851016

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Glazner, Tsfaty, Shalev et al.

We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

academic

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Basic Information

Paper ID: 2511.13944
Title: Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
Authors: Noam Glazner (Bar-Ilan University), Noam Tsfaty (Afeka College of Engineering), Sharon Shalev (Independent Researcher), Avishai Weizman (Ben-Gurion University of the Negev)
Category: cs.CV (Computer Vision)
Submission Date: November 17, 2025 to arXiv
Paper Link: https://arxiv.org/abs/2511.13944v1

Abstract

This paper proposes a cluster-based frame selection strategy to mitigate information leakage in video-derived frame datasets. By grouping visually similar frames before partitioning training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

Research Background and Motivation

Core Problem

In deep learning research, extracting frames from video data to construct datasets is a common practice. However, traditional random partitioning methods lead to severe information leakage issues: due to high spatiotemporal correlation between consecutive frames in videos (e.g., identical backgrounds, same objects with slightly different positions), if these correlated frames are scattered across training, validation, and test sets, models may "memorize" scene features from the training set, resulting in inflated performance estimates on validation and test sets.

Problem Significance

Model Evaluation Distortion: Information leakage causes model performance on test sets to fail to reflect true generalization ability
Overfitting Risk: Models may overfit to specific scenes rather than learning generalizable features
Research Reliability: Affects credibility of conclusions in computer vision tasks such as object detection
Practical Application Gap: Significant discrepancy between laboratory performance and real-world deployment performance

Limitations of Existing Methods

Random Partitioning: Completely ignores spatiotemporal correlation between frames
Video-Level Partitioning: Too coarse-grained, potentially causing unbalanced data distribution
Manual Partitioning: Labor-intensive and difficult to scale to large-scale datasets

Research Motivation

This paper aims to provide a simple, scalable, and integrable solution into existing dataset preparation workflows. By intelligently grouping visually similar frames, it ensures related images remain in the same data partition, thereby improving fairness of dataset partitioning and robustness of model evaluation.

Core Contributions

Proposes Cluster-Driven Dataset Partitioning Method: First systematically applies clustering techniques to video-derived dataset partitioning by grouping visually similar frames into the same partition to prevent information leakage
Comprehensive Feature Extractor Evaluation: Systematically compares seven different feature extraction methods (from traditional SIFT, HOG to modern CLIP, DINO-V3), providing practitioners with method selection guidance
Plug-and-Play Solution: Provides a dataset preprocessing pipeline requiring no modification to training processes, with good scalability and practical utility
Empirical Validation: Validates method effectiveness on two benchmark datasets (ImageNet-VID and UCF101), with DINO-V3 achieving V-measure and AMI scores of 0.96

Method Details

Task Definition

Input: A collection of unlabeled videos $V = \{V_1, V_2, \ldots, V_K\}$ , where K is the total number of videos

Output: Assigns all extracted frames to training, validation, and test sets, ensuring visually similar frames (particularly frames from the same video) are assigned to the same partition

Constraints:

Minimize information leakage between partitions
Maintain balanced data distribution across partitions
Ensure clustering results are highly consistent with video sources

Model Architecture

The overall pipeline consists of three main stages (as shown in Figure 1):

1. Feature Extraction Stage

Each video $V_k$ is decomposed into a frame sequence $\{I_{k,1}, I_{k,2}, \ldots, I_{k,N_k}\}$ , where $N_k$ is the number of frames extracted from video $V_k$ .

A feature vector is extracted for each frame $I_{k,i}$ : $f_{k,i} = \Phi_{feat}(I_{k,i})$

where $f_{k,i} \in \mathbb{R}^d$ is a d-dimensional feature vector, and $\Phi_{feat}(\cdot)$ is the feature extraction function.

Supported Feature Extraction Methods:

Traditional Descriptors:
- SIFT 8,9: Scale-Invariant Feature Transform, captures local texture information
- HOG 4: Histogram of Oriented Gradients, encodes gradient orientation patterns
Lightweight Learning Features:
- XFeat 5: Provides efficient keypoint detection and description through lightweight convolutional architecture
Deep Pretrained Models:
- CLIP 3: Contrastive Language-Image Pretraining, provides semantic image representations
- SigLIP 10: Language-Image Pretraining with Sigmoid loss
- DINO-V3 11: Self-supervised Vision Transformer
Aggregation Methods:
- VLAD 12: Vector of Locally Aggregated Descriptors, applied to SIFT and XFeat, combines local keypoint descriptors into fixed-length compact feature vectors (1024-dimensional)

2. Dimensionality Reduction and Clustering Stage

Dimensionality Reduction: Uses PaCMAP (Pairwise Controlled Manifold Approximation Projection) 6 to project high-dimensional features into low-dimensional embedding space: $z_{k,i} = P_{PaCMAP}(f_{k,i})$

where $z_{k,i} \in \mathbb{R}^m$ is an m-dimensional embedding representation (m=256 in this paper), and $P_{PaCMAP}(\cdot)$ is the PaCMAP projection operator.

Clustering: Applies HDBSCAN (Hierarchy of Density-Based Spatial Clustering) 7 algorithm to cluster embedding representations.

Rationale for HDBSCAN Selection:

Can discover clusters of arbitrary shapes
Adapts to data distributions with varying densities
Automatically determines the number of clusters
Can identify noise points
More suitable for continuous and non-uniform characteristics of video data compared to center-based methods like K-Means

3. Cluster-Based Dataset Partitioning

Uses clustering results $C_j$ (containing features $z_{k,i}$ corresponding to frames $I_{k,i}$ ) as the basic unit for partitioning. Each cluster $C_j$ represents visually related frames, and the entire cluster is assigned to the same data partition (training/validation/test), thereby preventing data leakage.

Technical Innovations

Application of Density Clustering: Compared to traditional video-level or random partitioning, density-based clustering can more finely capture visual similarity between frames while avoiding forced assumptions of spherical clusters
Systematic Evaluation of Feature Extraction: Rather than relying on a single feature extraction method, provides comprehensive comparison from traditional to modern approaches, making the method more adaptable
Two-Stage Dimensionality Reduction Strategy: First extracts high-dimensional features using specific methods, then uniformly reduces to 256 dimensions using PaCMAP, preserving semantic information while improving clustering efficiency
Plug-and-Play Design: As a data preprocessing step, requires no modification to model training processes, with good engineering practicality

Experimental Setup

Datasets

ImageNet-VID (ILSVRC2015)

Source: ImageNet Large Scale Visual Recognition Challenge 2015 14
Usage: Validation set
Characteristics: Provides annotated images classified by object synset, suitable for evaluating information leakage in object detection
Annotation Type: Image-level object category annotations

UCF101

Source: 101-class human action video dataset 15
Usage: All partitions
Characteristics: Contains trimmed video clips with video-level labels
Preprocessing: Extracts one frame per second to reduce visual redundancy, ensuring consecutive frames are not nearly identical
Challenge: Increased temporal variability makes clustering more difficult

Evaluation Metrics

Adjusted Mutual Information (AMI) 16

Definition: Measures consistency between predicted clusters and ground truth labels while correcting for chance
Range: 0, 1, where 1 indicates perfect match
Advantage: Accounts for baseline performance of random clustering

V-measure 17

Definition: Evaluates the trade-off between homogeneity and completeness of clustering
- Homogeneity: Degree to which samples in each cluster come from a single class
- Completeness: Degree to which samples from the same class share the same cluster
Range: 0, 1, where 1 indicates optimal performance
Calculation: Harmonic mean of homogeneity and completeness

Comparison Methods

This paper compares clustering performance across seven feature extraction methods:

SIFT + VLAD
HOG (224×224)
HOG (128×128)
XFeat + VLAD
CLIP (ViT-B/32)
SigLIP (ViT-B/16)
DINO-V3 (ViT-B/16)

Implementation Details

Image Preprocessing:

XFeat, CLIP, DINO, SigLIP: Resized to 224×224
HOG: 128×128 or 224×224 (128×128 performs slightly better with lower dimensionality)

Feature Dimensions:

VLAD vectors: Reduced to 1024 dimensions for unified representation
PaCMAP embeddings: Projected to 256-dimensional space (m=256)

Clustering Algorithm: HDBSCAN (specific hyperparameters not detailed in the paper)

Experimental Results

Main Results

Table I presents clustering performance using different feature extraction methods on ImageNet-VID and UCF101 validation sets:

Feature Extraction Method	Dataset	V-measure	AMI
SIFT + VLAD	ImageNet-VID	0.81	0.80
	UCF101	0.57	0.38
HOG (224×224)	ImageNet-VID	0.82	0.81
	UCF101	0.61	0.48
HOG (128×128)	ImageNet-VID	0.87	0.86
	UCF101	0.67	0.54
XFeat + VLAD	ImageNet-VID	0.90	0.89
	UCF101	0.72	0.58
CLIP (ViT-B/32)	ImageNet-VID	0.92	0.91
	UCF101	0.75	0.66
SigLIP (ViT-B/16)	ImageNet-VID	0.93	0.92
	UCF101	0.75	0.67
DINO-V3 (ViT-B/16)	ImageNet-VID	0.96	0.96
	UCF101	0.87	0.80

Key Findings

Deep Pretrained Models Significantly Outperform Traditional Methods:
- DINO-V3 achieves highest scores on both datasets
- On ImageNet-VID, DINO-V3 improves 18.5% over SIFT+VLAD (V-measure)
- On UCF101, improvement is more substantial at 52.6%
Dataset Difficulty Differences:
- All methods perform lower on UCF101 than ImageNet-VID
- Temporal variability in UCF101 increases clustering difficulty
- SIFT+VLAD performs weakest on UCF101 (AMI only 0.38)
Feature Extraction Method Performance Hierarchy:
- First Tier: DINO-V3 > SigLIP ≈ CLIP
- Second Tier: XFeat + VLAD
- Third Tier: HOG (128×128) > HOG (224×224)
- Fourth Tier: SIFT + VLAD
Potential of Lightweight Methods:
- XFeat + VLAD shows clear improvement over traditional descriptors
- Achieves 0.90 V-measure on ImageNet-VID
- Provides viable option for computationally constrained scenarios
Impact of Image Resolution:
- HOG performs better at 128×128 resolution than 224×224
- Lower resolution produces lower-dimensional descriptors while maintaining better performance

Experimental Findings

Advantages of Semantic Representations: Deep pretrained models (especially DINO-V3) capture high-level semantic information better, more effectively identifying visual similarity, which is crucial for information leakage detection
Effectiveness of Self-Supervised Learning: DINO-V3 as a self-supervised method performs best, demonstrating that representations suitable for clustering tasks can be learned without explicit supervision
Importance of Feature Aggregation: VLAD aggregation of local descriptors (SIFT, XFeat) significantly improves performance
Method Generalizability: The framework performs well on two datasets with different characteristics, demonstrating good generalization ability

Information Leakage Research

Botache et al. 1: Investigates complexity of partitioning sequential data, explores challenges in video and time series analysis
Figueiredo & Mendes 2: Analyzes information leakage in video object detection datasets, addresses it by partitioning images into clusters with high spatiotemporal correlation

Feature Extraction Techniques

Traditional Methods: Hand-crafted features like SIFT 8,9, HOG 4
Deep Learning Methods: Pretrained models like CLIP 3, SigLIP 10, DINO-V3 11
Lightweight Methods: XFeat 5 provides balance between efficiency and performance

Clustering Algorithms

Density Clustering: HDBSCAN 7 can discover clusters of arbitrary shapes
Dimensionality Reduction Techniques: PaCMAP 6 provides better global structure preservation compared to t-SNE and UMAP

Advantages of This Work

Compared to existing work, this paper:

Provides more systematic comparison of feature extraction methods
Employs density clustering more suitable for video data characteristics
Proposes complete end-to-end solution
Validates on multiple benchmark datasets

Conclusions and Discussion

Main Conclusions

Method Effectiveness: Cluster-based frame selection strategy effectively identifies and groups visually similar frames, preventing information leakage
Best Practices: DINO-V3 embeddings achieve best clustering performance on both datasets, making it the preferred choice in practice
Practical Value: The method is simple, scalable, and seamlessly integrates into existing dataset preparation workflows
Improvement Effects: By grouping frames before dataset partitioning, the method increases diversity and provides fair evaluation environment, mitigating overfitting in object detection models trained on video datasets

Limitations

Hyperparameter Dependency: Method depends on HDBSCAN hyperparameter selection, with different settings potentially affecting clustering results
Computational Cost: Feature extraction using deep pretrained models (e.g., DINO-V3) requires substantial computational resources
Lack of Downstream Task Validation: Paper does not provide performance comparison on actual object detection tasks (with vs. without the method)
Clustering Quality Assessment: Only uses AMI and V-measure for evaluation, lacking quantitative analysis of actual information leakage degree
Dataset Scale: Method's scalability not verified on ultra-large-scale datasets

Future Directions

Authors explicitly propose the following research directions:

Adaptive Clustering Strategies: Explore clustering methods that automatically adjust hyperparameters, reducing dependency on HDBSCAN hyperparameters
Performance Gap Quantification: Train image object detection models with/without the method to quantify actual impact of information leakage on model performance
Cross-Dataset Evaluation: Validate method effectiveness on more datasets with diverse characteristics
End-to-End Optimization: Potentially explore joint optimization of clustering and model training

In-Depth Evaluation

Strengths

1. Method Innovation

Problem-Specific: Directly addresses core pain point of video-derived datasets—information leakage
Elegant Solution: Cleverly applies clustering to dataset partitioning with clear and sound reasoning
Plug-and-Play Design: Requires no modification to training process, strong engineering practicality

2. Experimental Sufficiency

Comprehensive Feature Methods: Covers seven approaches from traditional to modern deep methods
Reasonable Dataset Selection: ImageNet-VID and UCF101 represent different types of video data
Appropriate Evaluation Metrics: AMI and V-measure are standard clustering quality assessment metrics

3. Result Convincingness

Significant Performance Gains: DINO-V3 achieves high scores of 0.80+ on both datasets
Strong Consistency: Deep methods outperform traditional methods on both datasets, robust conclusions
Detailed Numerical Results: Provides complete comparison data for all methods

4. Writing Quality

Clear Structure: Problem-Method-Experiment organization with strong logical flow
Accurate Expression: Technical descriptions are precise, mathematical notation used correctly
Effective Visualization: Figure 1 clearly presents overall pipeline

Weaknesses

1. Method Limitations

Lack of Theoretical Analysis: No theoretical explanation for why DINO-V3 performs best
Hyperparameter Sensitivity Unexplored: How HDBSCAN hyperparameters affect results not investigated
Cluster Number Control: How to control cluster numbers to balance partition sizes not discussed

2. Experimental Setup Defects

Missing Ablation Studies:
- Is PaCMAP dimensionality reduction necessary? How does direct high-dimensional clustering perform?
- Is 256-dimensional reduction optimal?
- Comparison with other clustering algorithms (K-Means, DBSCAN)?
Lack of Downstream Task Validation: Most critical issue—whether method truly improves model generalization—not verified
Missing Statistical Significance Tests: No error bars or significance testing provided

3. Insufficient Analysis Depth

Missing Failure Case Analysis: Which frame types are difficult to cluster correctly?
Insufficient Visualization: No t-SNE/UMAP visualization of clustering results
Missing Computational Cost Analysis: Runtime and memory consumption of each method not reported
Lack of Quantitative Information Leakage Analysis: Leakage degree caused by traditional methods not quantified

4. Limited Experimental Coverage

Limited Datasets: Only two datasets, lacking diverse validation
Single Task: Only addresses object detection, not exploring effects on other tasks (action recognition, segmentation)
Insufficient Scale Verification: Not tested on million-scale large datasets

Impact

Contribution to the Field

Improved Research Reliability: Provides standardized preprocessing method for video-derived dataset usage
Methodological Contribution: Emphasizes importance of dataset partitioning for model evaluation
Practical Guidance: Provides practitioners with feature extraction method selection recommendations

Practical Value

High: Method is simple and easily implementable, immediately applicable to real projects
Strong Generalizability: Applicable to all scenarios extracting frames from videos
Controllable Cost: One-time preprocessing cost, no additional training overhead

Reproducibility

Strengths:
- Clear method description
- Uses publicly available tools and models
- Explicit hyperparameter settings (image size, reduction dimension, etc.)
Weaknesses:
- No code or implementation details provided
- Specific HDBSCAN hyperparameters not specified
- Specific dataset partitioning strategy (e.g., 70/15/15) not clarified

Potential Impact

Short-term: Likely cited and adopted by papers related to dataset construction
Medium-term: May become standard preprocessing step for video dataset releases
Long-term: Promotes stricter dataset quality control standards

Applicable Scenarios

Most Suitable Scenarios

Video Object Detection: Paper's primary target scenario
Action Recognition: Frame extraction from videos for classification
Video Instance Segmentation: Tasks requiring frame-level annotations
Surveillance Video Analysis: Typically contains many similar frames

Scenarios Requiring Caution

Video Understanding Tasks: Tasks requiring temporal information preservation may not be suitable
Small-Scale Datasets: Clustering may be unstable
Highly Diverse Videos: If video content differs greatly, clustering may be overly fine-grained

Unsuitable Scenarios

Native Image Datasets: No information leakage issues
Tasks Requiring Temporal Modeling: Such as video prediction, optical flow estimation
Real-Time Applications: Deep feature extraction may be too slow

References

Key Citations

1 Botache et al., 2023 - Complexity of sequential data partitioning research
2 Figueiredo & Mendes, 2024 - Information leakage analysis in video object detection datasets (IEEE Access)
3 Radford et al., 2021 - CLIP: Learning Transferable Visual Models From Natural Language Supervision (ICML)
7 McInnes et al., 2017 - HDBSCAN: Hierarchical Density-Based Spatial Clustering Algorithm
11 Siméoni et al., 2025 - DINO-V3: Self-Supervised Vision Transformer (arXiv preprint)
14 Russakovsky et al., 2015 - ImageNet Large Scale Visual Recognition Challenge (IJCV)

Summary

This paper proposes a practical solution to information leakage in video-derived datasets. Core strengths lie in method simplicity and practicality—ensuring visually similar frames are assigned to the same data partition through clustering is intuitive and effective. Experimental results demonstrate that modern deep pretrained models (particularly DINO-V3) significantly outperform traditional methods in identifying frame similarity.

However, the paper's main deficiency is the lack of downstream task validation. While clustering quality is high (AMI and V-measure reaching 0.96), whether this truly translates to better model generalization remains unverified. This is a critical gap, as clustering quality is merely a means; improving model evaluation is the ultimate goal.

Despite this, the work provides important methodological contributions to video dataset construction with high practical value. Recommended future work:

Highest Priority: Validate method effectiveness on actual object detection tasks
Explore adaptive hyperparameter selection strategies
Extend to larger-scale and more diverse datasets
Provide open-source implementation to promote community adoption

Recommendation Score: ★★★★☆ (4/5)

Important and practical problem ✓
Simple and effective method ✓
Reasonably comprehensive experiments ✓
Lacks downstream validation ✗
Analysis depth could be improved ✗