2025-11-13T23:07:14.450110

Collaborative Unlabeled Data Optimization

Shang, Sun, Liu et al.
This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.
academic

Collaborative Unlabeled Data Optimization

Basic Information

  • Paper ID: 2505.14117
  • Title: Beyond Model-Centric: Collaborative Data Optimization for Reusing and Sharing
  • Authors: Xinyi Shang (UCL), Peng Sun (Zhejiang University & Westlake University), Fengyuan Liu (USTC), Tao Lin (Westlake University)
  • Classification: cs.LG cs.AI
  • Publication Date/Venue: Preprint (arXiv:2505.14117v2)
  • Paper Link: https://arxiv.org/abs/2505.14117v2

Abstract

This paper pioneered a novel data-centric paradigm aimed at maximizing the utility of unlabeled data, addressing a critical question: how can we enhance the sustainability and efficiency of deep learning training by optimizing the data itself? The authors first identified two key limitations of existing model-centric approaches, both stemming from a common bottleneck: knowledge extracted from data is locked within model parameters, hindering its reusability and scalability. To address this, they propose COOPT, an efficient parallelized collaborative unlabeled data optimization framework. By processing unlabeled data in a distributed manner and leveraging publicly available task-agnostic prior models, COOPT transforms raw unlabeled data into knowledge-rich training sets that are effective, efficient, reusable, and easily shareable. The method achieves a 7.9% improvement over BYOL on ImageNet-1K.

Research Background and Motivation

Problem Background

In the era of big data, despite data abundance, the majority of data remains unlabeled. The mainstream paradigm for leveraging unlabeled data is self-supervised learning (SSL), a model-centric approach that encodes data information into model parameters through carefully designed proxy tasks and loss functions.

Core Problems

Existing model-centric approaches face two key challenges:

  1. Architecture Coupling: Training protocols are tightly coupled with specific network architectures, severely hindering the transferability and reusability of trained models across other architectures
  2. Computational Efficiency Issues: Despite progress in acceleration, training on large-scale unlabeled datasets remains computationally prohibitive

Fundamental Bottleneck

The core of these challenges is a common bottleneck: knowledge extracted from data is locked within model parameters, limiting its adaptability and preventing efficient reuse across different tasks or architectures.

Research Motivation

To transcend the model-centric paradigm, the authors propose a data-centric paradigm that effectively encodes knowledge into the data itself rather than model parameters by directly optimizing unlabeled data.

Core Contributions

  1. Proposes COOPT Framework: The first data-centric framework for collaborative optimization of unlabeled data, which transforms raw unlabeled samples into optimized data by leveraging task-agnostic prior models, achieving high performance, efficiency, strong generalization, and reusability
  2. Identifies and Resolves Target Distribution Inconsistency: Identifies the critical issue of Target Distribution Inconsistency within the COOPT framework and introduces a lightweight target alignment strategy to address it
  3. Comprehensive Experimental Validation: Conducts extensive experiments across multiple datasets and models, validating COOPT's advantages and demonstrating that even with weak prior models, COOPT can effectively accelerate the early training stages

Methodology Details

Task Definition

Data Optimization Definition: Given a large-scale unlabeled dataset D=DX={xi}i=1ND = D_X = \{x_i\}_{i=1}^N, data optimization aims to assign targets DY={yi}i=1ND_Y = \{y_i\}_{i=1}^N to construct an optimal labeled dataset D={(xi,yi)}i=1ND' = \{(x_i, y_i)\}_{i=1}^N, such that models trained on DD' achieve significantly higher performance with substantially reduced training cost compared to models trained on DD.

Objective Function: E(x,y)PT[(ϕθD(x),y)]>E(x,y)PT[(ϕθD(x),y)]E_{(x,y)\sim P_T}[\ell(\phi_{\theta_D}(x), y)] > E_{(x,y)\sim P_T}[\ell(\phi_{\theta_{D'}}(x), y)]

where PTP_T is the test distribution, \ell is the loss function, and θD\theta_D and θD\theta_{D'} are network parameters trained on DD and DD' respectively.

Model Architecture

COOPT is a collaborative parallelized framework comprising an open data platform and K participants, each equipped with different prior models.

Five-Step Operation Pipeline:

Step 1: Data Distribution

  • The open data platform randomly partitions unlabeled data DD into K non-overlapping subsets
  • Each participant downloads one subset D(k)D^{(k)}

Step 2: Data Optimization

  • Each participant optimizes their respective dataset D(k)D^{(k)} using prior model ψk\psi_k
  • Target assignment according to Definition 1: D={(xi,yi)yi=Wψ(xi),xiDX}D' = \{(x_i, y_i) | y_i = W\psi(x_i), \forall x_i \in D_X\}

Step 3: Data Alignment

  • Resolves target distribution inconsistency
  • Uses learnable transformation matrix T(k)T^{(k)} to align target distributions to the optimal prior model

Step 4: Data Upload

  • Participants upload optimized datasets back to the platform

Step 5: Data Aggregation

  • Platform aggregates all optimized datasets to form a unified dataset

Technical Innovations

1. Target Distribution Inconsistency Identification

In the collaborative framework, different participants using different prior models lead to target distribution inconsistency, affecting model generalization.

2. Prior Model Quality Assessment

Assesses prior model quality using Uniform Value Loss: Vuniform(ψ;S)=logExi,xjS[eτψ(xi)ψ(xj)22]V_{uniform}(\psi; S) = \log E_{x_i, x_j \sim S}[e^{\tau \|\psi(x_i) - \psi(x_j)\|_2^2}]

where lower uniform values indicate higher quality prior models.

3. Target Alignment Strategy

Achieves target alignment through transformation matrix optimization: T(k)=argminTRn×n{Tψ(k)(SX)SY22}T^{(k)} = \arg\min_{T \in \mathbb{R}^{n \times n}} \{\|T \cdot \psi^{(k)}(S_X) - S_Y^*\|_2^2\}

where SYS_Y^* is the target of the optimal prior model on the shared dataset.

Experimental Setup

Datasets

  • ImageNet-1K (224×224)
  • Tiny-ImageNet (64×64)
  • CIFAR-100 (32×32)
  • CIFAR-10 (32×32)

Evaluation Metrics

  • Accuracy: Representation quality assessed using offline linear probing strategy
  • Computational Efficiency: Quantified through time cost (seconds)

Comparison Methods

Compared against state-of-the-art self-supervised learning methods:

  • SimCLR, BYOL, DINO, MoCo, SimSiam, SwAV, DCL

Implementation Details

  • Hardware: 4 NVIDIA RTX 4090 GPUs
  • Prior models: Multiple pretrained CLIP models
  • Optimizer: AdamW
  • Batch size: 128 (256 for ImageNet-1K)
  • Results reported as mean and variance over 3 random seeds

Experimental Results

Main Results

Comparison with Self-Supervised Learning Methods (Table 1):

  • CIFAR-10: 89.5% vs BYOL 82.8% (↑5.6%), training speedup 1.87×
  • CIFAR-100: 67.3% vs DCL 58.2% (↑9.1%), training speedup 1.95×
  • Tiny-ImageNet: 60.3% vs DCL 44.6% (↑15.7%), training speedup 1.94×
  • ImageNet-1K: 69.8% vs BYOL 61.9% (↑7.9%), training speedup 1.20×

Comparison with Centralized Optimization (Table 2):

  • COOPT on CIFAR-100: 65.8% vs centralized 62.1%
  • Training time: 16.31s vs 23.71s

Generalization and Reusability Experiments

Cross-Architecture Generalization (Table 3): COOPT significantly outperforms BYOL across multiple network architectures:

  • ResNet-50: 63.8% vs 60.4%
  • ResNet-101: 65.7% vs 61.5%
  • MobileNet-v2: 58.1% vs 24.0%
  • EfficientNet-b0: 70.7% vs 2.3%
  • ViT: 57.8% vs 38.5%

Ablation Studies

Necessity of Target Alignment:

  • Without alignment: significant performance degradation
  • Alignment to optimal model: 16.9% performance improvement
  • Alignment strategy effectiveness verified through t-SNE visualization

Impact of Shared Data Size:

  • Only 0.05% of shared data needed for good results
  • On ImageNet-1K, 0.001% of data is sufficient

Computational Overhead:

  • Uniform value estimation: 139.16s
  • Alignment process: 36.97s
  • Negligible compared to BYOL's 133,766.19s

Experimental Findings

  1. Weak Prior Models Remain Effective: Even with weak prior models, COOPT significantly accelerates early training stages
  2. Continuous Optimization Potential: Data quality continuously improves as prior models evolve, achieving 4.6% performance improvement after 10 rounds
  3. Prior Dataset Impact: Prior models trained on ImageNet-1K achieve significant improvements across all datasets

Self-Supervised Learning

Model-centric approaches learning representations through proxy tasks:

  • InstDisc: Instance discrimination
  • MoCo: Momentum contrast
  • SimCLR: Simple contrastive learning framework
  • BYOL: Bootstrap your own latent

Knowledge Distillation

Leverages soft labels from teacher models to improve student training, but knowledge remains locked in model parameters.

Dataset Distillation

Learns compact distilled datasets, primarily focusing on optimization of labeled data.

Conclusions and Discussion

Main Conclusions

  1. COOPT successfully transcends model-centric paradigm limitations, achieving data-centric collaborative optimization
  2. Optimized data exhibits architecture-agnostic properties, reusability, and efficiency
  3. Effective training acceleration is possible even with weak prior models

Limitations

  1. Overall performance inevitably declines when all prior models are extremely weak
  2. Privacy protection mechanisms require further enhancement
  3. Currently focuses primarily on optimization of open-source unlabeled data

Future Directions

  1. Develop more advanced strategies to effectively utilize data optimized by extremely weak prior models
  2. Enhance privacy protection mechanisms
  3. Extend to more data types and tasks

In-Depth Evaluation

Strengths

  1. Paradigm Innovation: Shift from model-centric to data-centric perspective carries significant theoretical importance
  2. Practical Value: Addresses real-world problems of knowledge reusability and training efficiency
  3. Systematic Approach: Provides comprehensive collaborative optimization framework including problem identification and solutions
  4. Sufficient Experiments: Comprehensive validation across multiple datasets and architectures

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks deep theoretical analysis of why data optimization is effective
  2. Limited Privacy Considerations: While privacy issues are mentioned, solutions are insufficient
  3. Prior Model Dependency: Method effectiveness heavily depends on prior model quality
  4. Scalability Verification: Requires validation on larger-scale datasets

Impact

  1. Academic Contribution: Provides new perspectives for unlabeled data utilization, potentially triggering paradigm shifts
  2. Practical Value: Significant application value for resource-constrained scenarios
  3. Reproducibility: Authors commit to releasing code, facilitating result reproduction

Applicable Scenarios

  1. Distributed Resource Scenarios: Multi-party collaboration with dispersed resources
  2. Frequent Model Changes: Scenarios requiring cross-architecture knowledge reuse
  3. Large-Scale Unlabeled Data: Cases where traditional self-supervised learning costs are prohibitive

References

This paper cites important works in self-supervised learning, knowledge distillation, and dataset distillation, including:

  • Chen et al. (2020): SimCLR
  • Grill et al. (2020): BYOL
  • He et al. (2020): MoCo
  • Wang & Isola (2020): Theoretical foundations of contrastive representation learning
  • Sun et al. (2024): Theoretical validation of RELA method