This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.
- Paper ID: 2505.14117
- Title: Beyond Model-Centric: Collaborative Data Optimization for Reusing and Sharing
- Authors: Xinyi Shang (UCL), Peng Sun (Zhejiang University & Westlake University), Fengyuan Liu (USTC), Tao Lin (Westlake University)
- Classification: cs.LG cs.AI
- Publication Date/Venue: Preprint (arXiv:2505.14117v2)
- Paper Link: https://arxiv.org/abs/2505.14117v2
This paper pioneered a novel data-centric paradigm aimed at maximizing the utility of unlabeled data, addressing a critical question: how can we enhance the sustainability and efficiency of deep learning training by optimizing the data itself? The authors first identified two key limitations of existing model-centric approaches, both stemming from a common bottleneck: knowledge extracted from data is locked within model parameters, hindering its reusability and scalability. To address this, they propose COOPT, an efficient parallelized collaborative unlabeled data optimization framework. By processing unlabeled data in a distributed manner and leveraging publicly available task-agnostic prior models, COOPT transforms raw unlabeled data into knowledge-rich training sets that are effective, efficient, reusable, and easily shareable. The method achieves a 7.9% improvement over BYOL on ImageNet-1K.
In the era of big data, despite data abundance, the majority of data remains unlabeled. The mainstream paradigm for leveraging unlabeled data is self-supervised learning (SSL), a model-centric approach that encodes data information into model parameters through carefully designed proxy tasks and loss functions.
Existing model-centric approaches face two key challenges:
- Architecture Coupling: Training protocols are tightly coupled with specific network architectures, severely hindering the transferability and reusability of trained models across other architectures
- Computational Efficiency Issues: Despite progress in acceleration, training on large-scale unlabeled datasets remains computationally prohibitive
The core of these challenges is a common bottleneck: knowledge extracted from data is locked within model parameters, limiting its adaptability and preventing efficient reuse across different tasks or architectures.
To transcend the model-centric paradigm, the authors propose a data-centric paradigm that effectively encodes knowledge into the data itself rather than model parameters by directly optimizing unlabeled data.
- Proposes COOPT Framework: The first data-centric framework for collaborative optimization of unlabeled data, which transforms raw unlabeled samples into optimized data by leveraging task-agnostic prior models, achieving high performance, efficiency, strong generalization, and reusability
- Identifies and Resolves Target Distribution Inconsistency: Identifies the critical issue of Target Distribution Inconsistency within the COOPT framework and introduces a lightweight target alignment strategy to address it
- Comprehensive Experimental Validation: Conducts extensive experiments across multiple datasets and models, validating COOPT's advantages and demonstrating that even with weak prior models, COOPT can effectively accelerate the early training stages
Data Optimization Definition: Given a large-scale unlabeled dataset D=DX={xi}i=1N, data optimization aims to assign targets DY={yi}i=1N to construct an optimal labeled dataset D′={(xi,yi)}i=1N, such that models trained on D′ achieve significantly higher performance with substantially reduced training cost compared to models trained on D.
Objective Function:
E(x,y)∼PT[ℓ(ϕθD(x),y)]>E(x,y)∼PT[ℓ(ϕθD′(x),y)]
where PT is the test distribution, ℓ is the loss function, and θD and θD′ are network parameters trained on D and D′ respectively.
COOPT is a collaborative parallelized framework comprising an open data platform and K participants, each equipped with different prior models.
Step 1: Data Distribution
- The open data platform randomly partitions unlabeled data D into K non-overlapping subsets
- Each participant downloads one subset D(k)
Step 2: Data Optimization
- Each participant optimizes their respective dataset D(k) using prior model ψk
- Target assignment according to Definition 1: D′={(xi,yi)∣yi=Wψ(xi),∀xi∈DX}
Step 3: Data Alignment
- Resolves target distribution inconsistency
- Uses learnable transformation matrix T(k) to align target distributions to the optimal prior model
Step 4: Data Upload
- Participants upload optimized datasets back to the platform
Step 5: Data Aggregation
- Platform aggregates all optimized datasets to form a unified dataset
In the collaborative framework, different participants using different prior models lead to target distribution inconsistency, affecting model generalization.
Assesses prior model quality using Uniform Value Loss:
Vuniform(ψ;S)=logExi,xj∼S[eτ∥ψ(xi)−ψ(xj)∥22]
where lower uniform values indicate higher quality prior models.
Achieves target alignment through transformation matrix optimization:
T(k)=argminT∈Rn×n{∥T⋅ψ(k)(SX)−SY∗∥22}
where SY∗ is the target of the optimal prior model on the shared dataset.
- ImageNet-1K (224×224)
- Tiny-ImageNet (64×64)
- CIFAR-100 (32×32)
- CIFAR-10 (32×32)
- Accuracy: Representation quality assessed using offline linear probing strategy
- Computational Efficiency: Quantified through time cost (seconds)
Compared against state-of-the-art self-supervised learning methods:
- SimCLR, BYOL, DINO, MoCo, SimSiam, SwAV, DCL
- Hardware: 4 NVIDIA RTX 4090 GPUs
- Prior models: Multiple pretrained CLIP models
- Optimizer: AdamW
- Batch size: 128 (256 for ImageNet-1K)
- Results reported as mean and variance over 3 random seeds
Comparison with Self-Supervised Learning Methods (Table 1):
- CIFAR-10: 89.5% vs BYOL 82.8% (↑5.6%), training speedup 1.87×
- CIFAR-100: 67.3% vs DCL 58.2% (↑9.1%), training speedup 1.95×
- Tiny-ImageNet: 60.3% vs DCL 44.6% (↑15.7%), training speedup 1.94×
- ImageNet-1K: 69.8% vs BYOL 61.9% (↑7.9%), training speedup 1.20×
Comparison with Centralized Optimization (Table 2):
- COOPT on CIFAR-100: 65.8% vs centralized 62.1%
- Training time: 16.31s vs 23.71s
Cross-Architecture Generalization (Table 3):
COOPT significantly outperforms BYOL across multiple network architectures:
- ResNet-50: 63.8% vs 60.4%
- ResNet-101: 65.7% vs 61.5%
- MobileNet-v2: 58.1% vs 24.0%
- EfficientNet-b0: 70.7% vs 2.3%
- ViT: 57.8% vs 38.5%
Necessity of Target Alignment:
- Without alignment: significant performance degradation
- Alignment to optimal model: 16.9% performance improvement
- Alignment strategy effectiveness verified through t-SNE visualization
Impact of Shared Data Size:
- Only 0.05% of shared data needed for good results
- On ImageNet-1K, 0.001% of data is sufficient
Computational Overhead:
- Uniform value estimation: 139.16s
- Alignment process: 36.97s
- Negligible compared to BYOL's 133,766.19s
- Weak Prior Models Remain Effective: Even with weak prior models, COOPT significantly accelerates early training stages
- Continuous Optimization Potential: Data quality continuously improves as prior models evolve, achieving 4.6% performance improvement after 10 rounds
- Prior Dataset Impact: Prior models trained on ImageNet-1K achieve significant improvements across all datasets
Model-centric approaches learning representations through proxy tasks:
- InstDisc: Instance discrimination
- MoCo: Momentum contrast
- SimCLR: Simple contrastive learning framework
- BYOL: Bootstrap your own latent
Leverages soft labels from teacher models to improve student training, but knowledge remains locked in model parameters.
Learns compact distilled datasets, primarily focusing on optimization of labeled data.
- COOPT successfully transcends model-centric paradigm limitations, achieving data-centric collaborative optimization
- Optimized data exhibits architecture-agnostic properties, reusability, and efficiency
- Effective training acceleration is possible even with weak prior models
- Overall performance inevitably declines when all prior models are extremely weak
- Privacy protection mechanisms require further enhancement
- Currently focuses primarily on optimization of open-source unlabeled data
- Develop more advanced strategies to effectively utilize data optimized by extremely weak prior models
- Enhance privacy protection mechanisms
- Extend to more data types and tasks
- Paradigm Innovation: Shift from model-centric to data-centric perspective carries significant theoretical importance
- Practical Value: Addresses real-world problems of knowledge reusability and training efficiency
- Systematic Approach: Provides comprehensive collaborative optimization framework including problem identification and solutions
- Sufficient Experiments: Comprehensive validation across multiple datasets and architectures
- Insufficient Theoretical Analysis: Lacks deep theoretical analysis of why data optimization is effective
- Limited Privacy Considerations: While privacy issues are mentioned, solutions are insufficient
- Prior Model Dependency: Method effectiveness heavily depends on prior model quality
- Scalability Verification: Requires validation on larger-scale datasets
- Academic Contribution: Provides new perspectives for unlabeled data utilization, potentially triggering paradigm shifts
- Practical Value: Significant application value for resource-constrained scenarios
- Reproducibility: Authors commit to releasing code, facilitating result reproduction
- Distributed Resource Scenarios: Multi-party collaboration with dispersed resources
- Frequent Model Changes: Scenarios requiring cross-architecture knowledge reuse
- Large-Scale Unlabeled Data: Cases where traditional self-supervised learning costs are prohibitive
This paper cites important works in self-supervised learning, knowledge distillation, and dataset distillation, including:
- Chen et al. (2020): SimCLR
- Grill et al. (2020): BYOL
- He et al. (2020): MoCo
- Wang & Isola (2020): Theoretical foundations of contrastive representation learning
- Sun et al. (2024): Theoretical validation of RELA method