We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
- Paper ID: 2511.13944
- Title: Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
- Authors: Noam Glazner (Bar-Ilan University), Noam Tsfaty (Afeka College of Engineering), Sharon Shalev (Independent Researcher), Avishai Weizman (Ben-Gurion University of the Negev)
- Category: cs.CV (Computer Vision)
- Submission Date: November 17, 2025 to arXiv
- Paper Link: https://arxiv.org/abs/2511.13944v1
This paper proposes a cluster-based frame selection strategy to mitigate information leakage in video-derived frame datasets. By grouping visually similar frames before partitioning training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
In deep learning research, extracting frames from video data to construct datasets is a common practice. However, traditional random partitioning methods lead to severe information leakage issues: due to high spatiotemporal correlation between consecutive frames in videos (e.g., identical backgrounds, same objects with slightly different positions), if these correlated frames are scattered across training, validation, and test sets, models may "memorize" scene features from the training set, resulting in inflated performance estimates on validation and test sets.
- Model Evaluation Distortion: Information leakage causes model performance on test sets to fail to reflect true generalization ability
- Overfitting Risk: Models may overfit to specific scenes rather than learning generalizable features
- Research Reliability: Affects credibility of conclusions in computer vision tasks such as object detection
- Practical Application Gap: Significant discrepancy between laboratory performance and real-world deployment performance
- Random Partitioning: Completely ignores spatiotemporal correlation between frames
- Video-Level Partitioning: Too coarse-grained, potentially causing unbalanced data distribution
- Manual Partitioning: Labor-intensive and difficult to scale to large-scale datasets
This paper aims to provide a simple, scalable, and integrable solution into existing dataset preparation workflows. By intelligently grouping visually similar frames, it ensures related images remain in the same data partition, thereby improving fairness of dataset partitioning and robustness of model evaluation.
- Proposes Cluster-Driven Dataset Partitioning Method: First systematically applies clustering techniques to video-derived dataset partitioning by grouping visually similar frames into the same partition to prevent information leakage
- Comprehensive Feature Extractor Evaluation: Systematically compares seven different feature extraction methods (from traditional SIFT, HOG to modern CLIP, DINO-V3), providing practitioners with method selection guidance
- Plug-and-Play Solution: Provides a dataset preprocessing pipeline requiring no modification to training processes, with good scalability and practical utility
- Empirical Validation: Validates method effectiveness on two benchmark datasets (ImageNet-VID and UCF101), with DINO-V3 achieving V-measure and AMI scores of 0.96
Input: A collection of unlabeled videos V={V1,V2,…,VK}, where K is the total number of videos
Output: Assigns all extracted frames to training, validation, and test sets, ensuring visually similar frames (particularly frames from the same video) are assigned to the same partition
Constraints:
- Minimize information leakage between partitions
- Maintain balanced data distribution across partitions
- Ensure clustering results are highly consistent with video sources
The overall pipeline consists of three main stages (as shown in Figure 1):
Each video Vk is decomposed into a frame sequence {Ik,1,Ik,2,…,Ik,Nk}, where Nk is the number of frames extracted from video Vk.
A feature vector is extracted for each frame Ik,i:
fk,i=Φfeat(Ik,i)
where fk,i∈Rd is a d-dimensional feature vector, and Φfeat(⋅) is the feature extraction function.
Supported Feature Extraction Methods:
- Traditional Descriptors:
- SIFT 8,9: Scale-Invariant Feature Transform, captures local texture information
- HOG 4: Histogram of Oriented Gradients, encodes gradient orientation patterns
- Lightweight Learning Features:
- XFeat 5: Provides efficient keypoint detection and description through lightweight convolutional architecture
- Deep Pretrained Models:
- CLIP 3: Contrastive Language-Image Pretraining, provides semantic image representations
- SigLIP 10: Language-Image Pretraining with Sigmoid loss
- DINO-V3 11: Self-supervised Vision Transformer
- Aggregation Methods:
- VLAD 12: Vector of Locally Aggregated Descriptors, applied to SIFT and XFeat, combines local keypoint descriptors into fixed-length compact feature vectors (1024-dimensional)
Dimensionality Reduction: Uses PaCMAP (Pairwise Controlled Manifold Approximation Projection) 6 to project high-dimensional features into low-dimensional embedding space:
zk,i=PPaCMAP(fk,i)
where zk,i∈Rm is an m-dimensional embedding representation (m=256 in this paper), and PPaCMAP(⋅) is the PaCMAP projection operator.
Clustering: Applies HDBSCAN (Hierarchy of Density-Based Spatial Clustering) 7 algorithm to cluster embedding representations.
Rationale for HDBSCAN Selection:
- Can discover clusters of arbitrary shapes
- Adapts to data distributions with varying densities
- Automatically determines the number of clusters
- Can identify noise points
- More suitable for continuous and non-uniform characteristics of video data compared to center-based methods like K-Means
Uses clustering results Cj (containing features zk,i corresponding to frames Ik,i) as the basic unit for partitioning. Each cluster Cj represents visually related frames, and the entire cluster is assigned to the same data partition (training/validation/test), thereby preventing data leakage.
- Application of Density Clustering: Compared to traditional video-level or random partitioning, density-based clustering can more finely capture visual similarity between frames while avoiding forced assumptions of spherical clusters
- Systematic Evaluation of Feature Extraction: Rather than relying on a single feature extraction method, provides comprehensive comparison from traditional to modern approaches, making the method more adaptable
- Two-Stage Dimensionality Reduction Strategy: First extracts high-dimensional features using specific methods, then uniformly reduces to 256 dimensions using PaCMAP, preserving semantic information while improving clustering efficiency
- Plug-and-Play Design: As a data preprocessing step, requires no modification to model training processes, with good engineering practicality
- Source: ImageNet Large Scale Visual Recognition Challenge 2015 14
- Usage: Validation set
- Characteristics: Provides annotated images classified by object synset, suitable for evaluating information leakage in object detection
- Annotation Type: Image-level object category annotations
- Source: 101-class human action video dataset 15
- Usage: All partitions
- Characteristics: Contains trimmed video clips with video-level labels
- Preprocessing: Extracts one frame per second to reduce visual redundancy, ensuring consecutive frames are not nearly identical
- Challenge: Increased temporal variability makes clustering more difficult
- Definition: Measures consistency between predicted clusters and ground truth labels while correcting for chance
- Range: 0, 1, where 1 indicates perfect match
- Advantage: Accounts for baseline performance of random clustering
- Definition: Evaluates the trade-off between homogeneity and completeness of clustering
- Homogeneity: Degree to which samples in each cluster come from a single class
- Completeness: Degree to which samples from the same class share the same cluster
- Range: 0, 1, where 1 indicates optimal performance
- Calculation: Harmonic mean of homogeneity and completeness
This paper compares clustering performance across seven feature extraction methods:
- SIFT + VLAD
- HOG (224×224)
- HOG (128×128)
- XFeat + VLAD
- CLIP (ViT-B/32)
- SigLIP (ViT-B/16)
- DINO-V3 (ViT-B/16)
Image Preprocessing:
- XFeat, CLIP, DINO, SigLIP: Resized to 224×224
- HOG: 128×128 or 224×224 (128×128 performs slightly better with lower dimensionality)
Feature Dimensions:
- VLAD vectors: Reduced to 1024 dimensions for unified representation
- PaCMAP embeddings: Projected to 256-dimensional space (m=256)
Clustering Algorithm: HDBSCAN (specific hyperparameters not detailed in the paper)
Table I presents clustering performance using different feature extraction methods on ImageNet-VID and UCF101 validation sets:
| Feature Extraction Method | Dataset | V-measure | AMI |
|---|
| SIFT + VLAD | ImageNet-VID | 0.81 | 0.80 |
| UCF101 | 0.57 | 0.38 |
| HOG (224×224) | ImageNet-VID | 0.82 | 0.81 |
| UCF101 | 0.61 | 0.48 |
| HOG (128×128) | ImageNet-VID | 0.87 | 0.86 |
| UCF101 | 0.67 | 0.54 |
| XFeat + VLAD | ImageNet-VID | 0.90 | 0.89 |
| UCF101 | 0.72 | 0.58 |
| CLIP (ViT-B/32) | ImageNet-VID | 0.92 | 0.91 |
| UCF101 | 0.75 | 0.66 |
| SigLIP (ViT-B/16) | ImageNet-VID | 0.93 | 0.92 |
| UCF101 | 0.75 | 0.67 |
| DINO-V3 (ViT-B/16) | ImageNet-VID | 0.96 | 0.96 |
| UCF101 | 0.87 | 0.80 |
- Deep Pretrained Models Significantly Outperform Traditional Methods:
- DINO-V3 achieves highest scores on both datasets
- On ImageNet-VID, DINO-V3 improves 18.5% over SIFT+VLAD (V-measure)
- On UCF101, improvement is more substantial at 52.6%
- Dataset Difficulty Differences:
- All methods perform lower on UCF101 than ImageNet-VID
- Temporal variability in UCF101 increases clustering difficulty
- SIFT+VLAD performs weakest on UCF101 (AMI only 0.38)
- Feature Extraction Method Performance Hierarchy:
- First Tier: DINO-V3 > SigLIP ≈ CLIP
- Second Tier: XFeat + VLAD
- Third Tier: HOG (128×128) > HOG (224×224)
- Fourth Tier: SIFT + VLAD
- Potential of Lightweight Methods:
- XFeat + VLAD shows clear improvement over traditional descriptors
- Achieves 0.90 V-measure on ImageNet-VID
- Provides viable option for computationally constrained scenarios
- Impact of Image Resolution:
- HOG performs better at 128×128 resolution than 224×224
- Lower resolution produces lower-dimensional descriptors while maintaining better performance
- Advantages of Semantic Representations: Deep pretrained models (especially DINO-V3) capture high-level semantic information better, more effectively identifying visual similarity, which is crucial for information leakage detection
- Effectiveness of Self-Supervised Learning: DINO-V3 as a self-supervised method performs best, demonstrating that representations suitable for clustering tasks can be learned without explicit supervision
- Importance of Feature Aggregation: VLAD aggregation of local descriptors (SIFT, XFeat) significantly improves performance
- Method Generalizability: The framework performs well on two datasets with different characteristics, demonstrating good generalization ability
- Botache et al. 1: Investigates complexity of partitioning sequential data, explores challenges in video and time series analysis
- Figueiredo & Mendes 2: Analyzes information leakage in video object detection datasets, addresses it by partitioning images into clusters with high spatiotemporal correlation
- Traditional Methods: Hand-crafted features like SIFT 8,9, HOG 4
- Deep Learning Methods: Pretrained models like CLIP 3, SigLIP 10, DINO-V3 11
- Lightweight Methods: XFeat 5 provides balance between efficiency and performance
- Density Clustering: HDBSCAN 7 can discover clusters of arbitrary shapes
- Dimensionality Reduction Techniques: PaCMAP 6 provides better global structure preservation compared to t-SNE and UMAP
Compared to existing work, this paper:
- Provides more systematic comparison of feature extraction methods
- Employs density clustering more suitable for video data characteristics
- Proposes complete end-to-end solution
- Validates on multiple benchmark datasets
- Method Effectiveness: Cluster-based frame selection strategy effectively identifies and groups visually similar frames, preventing information leakage
- Best Practices: DINO-V3 embeddings achieve best clustering performance on both datasets, making it the preferred choice in practice
- Practical Value: The method is simple, scalable, and seamlessly integrates into existing dataset preparation workflows
- Improvement Effects: By grouping frames before dataset partitioning, the method increases diversity and provides fair evaluation environment, mitigating overfitting in object detection models trained on video datasets
- Hyperparameter Dependency: Method depends on HDBSCAN hyperparameter selection, with different settings potentially affecting clustering results
- Computational Cost: Feature extraction using deep pretrained models (e.g., DINO-V3) requires substantial computational resources
- Lack of Downstream Task Validation: Paper does not provide performance comparison on actual object detection tasks (with vs. without the method)
- Clustering Quality Assessment: Only uses AMI and V-measure for evaluation, lacking quantitative analysis of actual information leakage degree
- Dataset Scale: Method's scalability not verified on ultra-large-scale datasets
Authors explicitly propose the following research directions:
- Adaptive Clustering Strategies: Explore clustering methods that automatically adjust hyperparameters, reducing dependency on HDBSCAN hyperparameters
- Performance Gap Quantification: Train image object detection models with/without the method to quantify actual impact of information leakage on model performance
- Cross-Dataset Evaluation: Validate method effectiveness on more datasets with diverse characteristics
- End-to-End Optimization: Potentially explore joint optimization of clustering and model training
- Problem-Specific: Directly addresses core pain point of video-derived datasets—information leakage
- Elegant Solution: Cleverly applies clustering to dataset partitioning with clear and sound reasoning
- Plug-and-Play Design: Requires no modification to training process, strong engineering practicality
- Comprehensive Feature Methods: Covers seven approaches from traditional to modern deep methods
- Reasonable Dataset Selection: ImageNet-VID and UCF101 represent different types of video data
- Appropriate Evaluation Metrics: AMI and V-measure are standard clustering quality assessment metrics
- Significant Performance Gains: DINO-V3 achieves high scores of 0.80+ on both datasets
- Strong Consistency: Deep methods outperform traditional methods on both datasets, robust conclusions
- Detailed Numerical Results: Provides complete comparison data for all methods
- Clear Structure: Problem-Method-Experiment organization with strong logical flow
- Accurate Expression: Technical descriptions are precise, mathematical notation used correctly
- Effective Visualization: Figure 1 clearly presents overall pipeline
- Lack of Theoretical Analysis: No theoretical explanation for why DINO-V3 performs best
- Hyperparameter Sensitivity Unexplored: How HDBSCAN hyperparameters affect results not investigated
- Cluster Number Control: How to control cluster numbers to balance partition sizes not discussed
- Missing Ablation Studies:
- Is PaCMAP dimensionality reduction necessary? How does direct high-dimensional clustering perform?
- Is 256-dimensional reduction optimal?
- Comparison with other clustering algorithms (K-Means, DBSCAN)?
- Lack of Downstream Task Validation: Most critical issue—whether method truly improves model generalization—not verified
- Missing Statistical Significance Tests: No error bars or significance testing provided
- Missing Failure Case Analysis: Which frame types are difficult to cluster correctly?
- Insufficient Visualization: No t-SNE/UMAP visualization of clustering results
- Missing Computational Cost Analysis: Runtime and memory consumption of each method not reported
- Lack of Quantitative Information Leakage Analysis: Leakage degree caused by traditional methods not quantified
- Limited Datasets: Only two datasets, lacking diverse validation
- Single Task: Only addresses object detection, not exploring effects on other tasks (action recognition, segmentation)
- Insufficient Scale Verification: Not tested on million-scale large datasets
- Improved Research Reliability: Provides standardized preprocessing method for video-derived dataset usage
- Methodological Contribution: Emphasizes importance of dataset partitioning for model evaluation
- Practical Guidance: Provides practitioners with feature extraction method selection recommendations
- High: Method is simple and easily implementable, immediately applicable to real projects
- Strong Generalizability: Applicable to all scenarios extracting frames from videos
- Controllable Cost: One-time preprocessing cost, no additional training overhead
- Strengths:
- Clear method description
- Uses publicly available tools and models
- Explicit hyperparameter settings (image size, reduction dimension, etc.)
- Weaknesses:
- No code or implementation details provided
- Specific HDBSCAN hyperparameters not specified
- Specific dataset partitioning strategy (e.g., 70/15/15) not clarified
- Short-term: Likely cited and adopted by papers related to dataset construction
- Medium-term: May become standard preprocessing step for video dataset releases
- Long-term: Promotes stricter dataset quality control standards
- Video Object Detection: Paper's primary target scenario
- Action Recognition: Frame extraction from videos for classification
- Video Instance Segmentation: Tasks requiring frame-level annotations
- Surveillance Video Analysis: Typically contains many similar frames
- Video Understanding Tasks: Tasks requiring temporal information preservation may not be suitable
- Small-Scale Datasets: Clustering may be unstable
- Highly Diverse Videos: If video content differs greatly, clustering may be overly fine-grained
- Native Image Datasets: No information leakage issues
- Tasks Requiring Temporal Modeling: Such as video prediction, optical flow estimation
- Real-Time Applications: Deep feature extraction may be too slow
- 1 Botache et al., 2023 - Complexity of sequential data partitioning research
- 2 Figueiredo & Mendes, 2024 - Information leakage analysis in video object detection datasets (IEEE Access)
- 3 Radford et al., 2021 - CLIP: Learning Transferable Visual Models From Natural Language Supervision (ICML)
- 7 McInnes et al., 2017 - HDBSCAN: Hierarchical Density-Based Spatial Clustering Algorithm
- 11 Siméoni et al., 2025 - DINO-V3: Self-Supervised Vision Transformer (arXiv preprint)
- 14 Russakovsky et al., 2015 - ImageNet Large Scale Visual Recognition Challenge (IJCV)
This paper proposes a practical solution to information leakage in video-derived datasets. Core strengths lie in method simplicity and practicality—ensuring visually similar frames are assigned to the same data partition through clustering is intuitive and effective. Experimental results demonstrate that modern deep pretrained models (particularly DINO-V3) significantly outperform traditional methods in identifying frame similarity.
However, the paper's main deficiency is the lack of downstream task validation. While clustering quality is high (AMI and V-measure reaching 0.96), whether this truly translates to better model generalization remains unverified. This is a critical gap, as clustering quality is merely a means; improving model evaluation is the ultimate goal.
Despite this, the work provides important methodological contributions to video dataset construction with high practical value. Recommended future work:
- Highest Priority: Validate method effectiveness on actual object detection tasks
- Explore adaptive hyperparameter selection strategies
- Extend to larger-scale and more diverse datasets
- Provide open-source implementation to promote community adoption
Recommendation Score: ★★★★☆ (4/5)
- Important and practical problem ✓
- Simple and effective method ✓
- Reasonably comprehensive experiments ✓
- Lacks downstream validation ✗
- Analysis depth could be improved ✗