2025-11-13T23:07:14.450110

Collaborative Unlabeled Data Optimization

Shang, Sun, Liu et al.

This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.

academic

Collaborative Unlabeled Data Optimization

Basic Information

Paper ID: 2505.14117
Title: Beyond Model-Centric: Collaborative Data Optimization for Reusing and Sharing
Authors: Xinyi Shang (UCL), Peng Sun (Zhejiang University & Westlake University), Fengyuan Liu (USTC), Tao Lin (Westlake University)
Classification: cs.LG cs.AI
Publication Date/Venue: Preprint (arXiv:2505.14117v2)
Paper Link: https://arxiv.org/abs/2505.14117v2

Abstract

This paper pioneered a novel data-centric paradigm aimed at maximizing the utility of unlabeled data, addressing a critical question: how can we enhance the sustainability and efficiency of deep learning training by optimizing the data itself? The authors first identified two key limitations of existing model-centric approaches, both stemming from a common bottleneck: knowledge extracted from data is locked within model parameters, hindering its reusability and scalability. To address this, they propose COOPT, an efficient parallelized collaborative unlabeled data optimization framework. By processing unlabeled data in a distributed manner and leveraging publicly available task-agnostic prior models, COOPT transforms raw unlabeled data into knowledge-rich training sets that are effective, efficient, reusable, and easily shareable. The method achieves a 7.9% improvement over BYOL on ImageNet-1K.

Research Background and Motivation

Problem Background

In the era of big data, despite data abundance, the majority of data remains unlabeled. The mainstream paradigm for leveraging unlabeled data is self-supervised learning (SSL), a model-centric approach that encodes data information into model parameters through carefully designed proxy tasks and loss functions.

Core Problems

Existing model-centric approaches face two key challenges:

Architecture Coupling: Training protocols are tightly coupled with specific network architectures, severely hindering the transferability and reusability of trained models across other architectures
Computational Efficiency Issues: Despite progress in acceleration, training on large-scale unlabeled datasets remains computationally prohibitive

Fundamental Bottleneck

The core of these challenges is a common bottleneck: knowledge extracted from data is locked within model parameters, limiting its adaptability and preventing efficient reuse across different tasks or architectures.

Research Motivation

To transcend the model-centric paradigm, the authors propose a data-centric paradigm that effectively encodes knowledge into the data itself rather than model parameters by directly optimizing unlabeled data.

Core Contributions

Proposes COOPT Framework: The first data-centric framework for collaborative optimization of unlabeled data, which transforms raw unlabeled samples into optimized data by leveraging task-agnostic prior models, achieving high performance, efficiency, strong generalization, and reusability
Identifies and Resolves Target Distribution Inconsistency: Identifies the critical issue of Target Distribution Inconsistency within the COOPT framework and introduces a lightweight target alignment strategy to address it
Comprehensive Experimental Validation: Conducts extensive experiments across multiple datasets and models, validating COOPT's advantages and demonstrating that even with weak prior models, COOPT can effectively accelerate the early training stages

Methodology Details

Task Definition

Data Optimization Definition: Given a large-scale unlabeled dataset $D = D_X = \{x_i\}_{i=1}^N$ , data optimization aims to assign targets $D_Y = \{y_i\}_{i=1}^N$ to construct an optimal labeled dataset $D' = \{(x_i, y_i)\}_{i=1}^N$ , such that models trained on $D'$ achieve significantly higher performance with substantially reduced training cost compared to models trained on $D$ .

Objective Function: $E_{(x,y)\sim P_T}[\ell(\phi_{\theta_D}(x), y)] > E_{(x,y)\sim P_T}[\ell(\phi_{\theta_{D'}}(x), y)]$

where $P_T$ is the test distribution, $\ell$ is the loss function, and $\theta_D$ and $\theta_{D'}$ are network parameters trained on $D$ and $D'$ respectively.

Model Architecture

COOPT is a collaborative parallelized framework comprising an open data platform and K participants, each equipped with different prior models.

Five-Step Operation Pipeline:

Step 1: Data Distribution

The open data platform randomly partitions unlabeled data $D$ into K non-overlapping subsets
Each participant downloads one subset $D^{(k)}$

Step 2: Data Optimization

Each participant optimizes their respective dataset $D^{(k)}$ using prior model $\psi_k$
Target assignment according to Definition 1: $D' = \{(x_i, y_i) | y_i = W\psi(x_i), \forall x_i \in D_X\}$

Step 3: Data Alignment

Resolves target distribution inconsistency
Uses learnable transformation matrix $T^{(k)}$ to align target distributions to the optimal prior model

Step 4: Data Upload

Participants upload optimized datasets back to the platform

Step 5: Data Aggregation

Platform aggregates all optimized datasets to form a unified dataset

Technical Innovations

1. Target Distribution Inconsistency Identification

In the collaborative framework, different participants using different prior models lead to target distribution inconsistency, affecting model generalization.

2. Prior Model Quality Assessment

Assesses prior model quality using Uniform Value Loss: $V_{uniform}(\psi; S) = \log E_{x_i, x_j \sim S}[e^{\tau \|\psi(x_i) - \psi(x_j)\|_2^2}]$

where lower uniform values indicate higher quality prior models.

3. Target Alignment Strategy

Achieves target alignment through transformation matrix optimization: $T^{(k)} = \arg\min_{T \in \mathbb{R}^{n \times n}} \{\|T \cdot \psi^{(k)}(S_X) - S_Y^*\|_2^2\}$

where $S_Y^*$ is the target of the optimal prior model on the shared dataset.

Experimental Setup

Datasets

ImageNet-1K (224×224)
Tiny-ImageNet (64×64)
CIFAR-100 (32×32)
CIFAR-10 (32×32)

Evaluation Metrics

Accuracy: Representation quality assessed using offline linear probing strategy
Computational Efficiency: Quantified through time cost (seconds)

Comparison Methods

Compared against state-of-the-art self-supervised learning methods:

SimCLR, BYOL, DINO, MoCo, SimSiam, SwAV, DCL

Implementation Details

Hardware: 4 NVIDIA RTX 4090 GPUs
Prior models: Multiple pretrained CLIP models
Optimizer: AdamW
Batch size: 128 (256 for ImageNet-1K)
Results reported as mean and variance over 3 random seeds

Experimental Results

Main Results

Comparison with Self-Supervised Learning Methods (Table 1):

CIFAR-10: 89.5% vs BYOL 82.8% (↑5.6%), training speedup 1.87×
CIFAR-100: 67.3% vs DCL 58.2% (↑9.1%), training speedup 1.95×
Tiny-ImageNet: 60.3% vs DCL 44.6% (↑15.7%), training speedup 1.94×
ImageNet-1K: 69.8% vs BYOL 61.9% (↑7.9%), training speedup 1.20×

Comparison with Centralized Optimization (Table 2):

COOPT on CIFAR-100: 65.8% vs centralized 62.1%
Training time: 16.31s vs 23.71s

Generalization and Reusability Experiments

Cross-Architecture Generalization (Table 3): COOPT significantly outperforms BYOL across multiple network architectures:

ResNet-50: 63.8% vs 60.4%
ResNet-101: 65.7% vs 61.5%
MobileNet-v2: 58.1% vs 24.0%
EfficientNet-b0: 70.7% vs 2.3%
ViT: 57.8% vs 38.5%

Ablation Studies

Necessity of Target Alignment:

Without alignment: significant performance degradation
Alignment to optimal model: 16.9% performance improvement
Alignment strategy effectiveness verified through t-SNE visualization

Impact of Shared Data Size:

Only 0.05% of shared data needed for good results
On ImageNet-1K, 0.001% of data is sufficient

Computational Overhead:

Uniform value estimation: 139.16s
Alignment process: 36.97s
Negligible compared to BYOL's 133,766.19s

Experimental Findings

Weak Prior Models Remain Effective: Even with weak prior models, COOPT significantly accelerates early training stages
Continuous Optimization Potential: Data quality continuously improves as prior models evolve, achieving 4.6% performance improvement after 10 rounds
Prior Dataset Impact: Prior models trained on ImageNet-1K achieve significant improvements across all datasets

Self-Supervised Learning

Model-centric approaches learning representations through proxy tasks:

InstDisc: Instance discrimination
MoCo: Momentum contrast
SimCLR: Simple contrastive learning framework
BYOL: Bootstrap your own latent

COOPT successfully transcends model-centric paradigm limitations, achieving data-centric collaborative optimization
Optimized data exhibits architecture-agnostic properties, reusability, and efficiency
Effective training acceleration is possible even with weak prior models

Limitations

Overall performance inevitably declines when all prior models are extremely weak
Privacy protection mechanisms require further enhancement
Currently focuses primarily on optimization of open-source unlabeled data

Future Directions

Develop more advanced strategies to effectively utilize data optimized by extremely weak prior models
Enhance privacy protection mechanisms
Extend to more data types and tasks

In-Depth Evaluation

Strengths

Paradigm Innovation: Shift from model-centric to data-centric perspective carries significant theoretical importance
Practical Value: Addresses real-world problems of knowledge reusability and training efficiency
Systematic Approach: Provides comprehensive collaborative optimization framework including problem identification and solutions
Sufficient Experiments: Comprehensive validation across multiple datasets and architectures

Weaknesses

Insufficient Theoretical Analysis: Lacks deep theoretical analysis of why data optimization is effective
Limited Privacy Considerations: While privacy issues are mentioned, solutions are insufficient
Prior Model Dependency: Method effectiveness heavily depends on prior model quality
Scalability Verification: Requires validation on larger-scale datasets

Impact

Academic Contribution: Provides new perspectives for unlabeled data utilization, potentially triggering paradigm shifts
Practical Value: Significant application value for resource-constrained scenarios
Reproducibility: Authors commit to releasing code, facilitating result reproduction

Applicable Scenarios

Distributed Resource Scenarios: Multi-party collaboration with dispersed resources
Frequent Model Changes: Scenarios requiring cross-architecture knowledge reuse
Large-Scale Unlabeled Data: Cases where traditional self-supervised learning costs are prohibitive

References

This paper cites important works in self-supervised learning, knowledge distillation, and dataset distillation, including:

Chen et al. (2020): SimCLR
Grill et al. (2020): BYOL
He et al. (2020): MoCo
Wang & Isola (2020): Theoretical foundations of contrastive representation learning
Sun et al. (2024): Theoretical validation of RELA method