2025-11-19T09:43:12.754426

Understanding Self-supervised Contrastive Learning through Supervised Objectives

Lee
Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.
academic

Understanding Self-supervised Contrastive Learning through Supervised Objectives

Basic Information

  • Paper ID: 2510.10572
  • Title: Understanding Self-supervised Contrastive Learning through Supervised Objectives
  • Author: Byeongchan Lee (KAIST)
  • Classification: cs.LG (Machine Learning)
  • Publication Venue: Transactions on Machine Learning Research (10/2025)
  • Paper Link: https://arxiv.org/abs/2510.10572

Abstract

Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. This paper provides a theoretical perspective by formulating self-supervised representation learning as an approximation of supervised representation learning objectives. Based on this formulation, the authors derive loss functions closely related to popular contrastive losses such as InfoNCE, providing insights into their underlying principles. The derivation naturally introduces the concepts of prototype representation bias and balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms.

Research Background and Motivation

Core Problems

  1. Missing Theoretical Understanding: Although self-supervised learning has achieved empirical success, its theoretical foundations remain incomplete, lacking deep understanding of why these methods work.
  2. Empirical Method Design: Existing self-supervised learning methods advance primarily through architectural innovations rather than from formalized objectives, lacking theoretical guidance.
  3. Unclear Relationship Between Supervised and Self-supervised Learning: The intrinsic connections between supervised and self-supervised learning have not been adequately elucidated.

Research Motivation

  • Building Theoretical Foundations: Provide solid theoretical foundations for self-supervised learning and explain the fundamental reasons for its effectiveness
  • Guiding Method Improvement: Provide principled guidance for algorithm design through theoretical analysis
  • Bridging Supervised and Self-supervised Learning: Establish theoretical connections between the two learning paradigms

Core Contributions

  1. Theoretical Framework Construction: Proposes a theoretical framework that formulates self-supervised representation learning as an approximation of supervised representation learning, deriving contrastive loss functions closely related to InfoNCE
  2. Theoretical Insights: Provides theoretical explanations for common practices in contrastive learning, such as representation normalization and balanced dataset usage
  3. Concept Introduction: Introduces the concept of prototype representation bias and observes its correlation with downstream performance
  4. Method Improvement: Proposes balanced contrastive loss as a natural extension of InfoNCE loss, achieving better performance through improved balance

Methodology Details

Task Definition

Defines the representation learning task as learning an encoder fθ:XRd{0}f_θ: \mathcal{X} → \mathbb{R}^d \setminus \{0\} such that:

  • Representations of images with the same visual concept cluster together
  • Representations of images with different visual concepts separate from each other

Theoretical Framework

Supervised Representation Learning Problem

First formulates supervised learning as a prototype optimization problem: minθs(fθ(t(x)),μy)+λmaxyys(fθ(t(x)),μy)\min_θ -s(f_θ(t(x)), μ_y) + λ \max_{y' ≠ y} s(f_θ(t(x)), μ_{y'})

where:

  • s(,)s(·,·) is a similarity measure (cosine similarity)
  • μyμ_y is the prototype representation for label yy
  • λ>0λ > 0 is a balancing parameter

Prototype Representation Construction

Defines prototype representation as the expectation of representations for images with the same label: μ^y:=ET,Xyfθ(T(X))\hat{μ}_y := \mathbb{E}_{T,X|y}f_θ(T(X))

Self-supervised Approximation

In the self-supervised setting, uses a surrogate prototype representation: μ~:=ETfθ(T(x))\tilde{μ} := \mathbb{E}_T f_θ(T(x))

Theoretical Derivation

Attraction Term Upper Bound (Theorem 4.4)

Under cosine similarity and L2 normalization assumptions: s(fθ(t(x)),ETfθ(T(x)))ETs(fθ(t(x)),fθ(T(x)))-s(f_θ(t(x)), \mathbb{E}_T f_θ(T(x))) ≤ -\mathbb{E}_T s(f_θ(t(x)), f_θ(T(x)))

Repulsion Term Upper Bound (Theorem 4.6)

Under balanced dataset assumptions: maxyys(fθ(t(x)),ET,Xyfθ(T(X)))ET[1ναlogEXexp(αs(fθ(t(x)),fθ(T(X))))]+1ναlogn\max_{y' ≠ y} s(f_θ(t(x)), \mathbb{E}_{T',X'|y'}f_θ(T'(X'))) ≤ \mathbb{E}_{T'}\left[\frac{1}{να}\log\mathbb{E}_{X'}\exp(αs(f_θ(t(x)), f_θ(T'(X'))))\right] + \frac{1}{να}\log n

Total Loss Function

Combining the above upper bounds yields: l~(θ)=1αT^tT^[logexp(αs(fθ(t(x)),fθ(t(x))))(xX^exp(αs(fθ(t(x)),fθ(t(x)))))λ/ν]\tilde{l}(θ) = \frac{1}{α|\hat{T}|}\sum_{t' ∈ \hat{T}}\left[-\log\frac{\exp(αs(f_θ(t(x)), f_θ(t'(x))))}{\left(\sum_{x' ∈ \hat{X}}\exp(αs(f_θ(t(x)), f_θ(t'(x'))))\right)^{λ/ν}}\right]

Technical Innovations

  1. Theoretical Bridge: First establishes formal theoretical connections between supervised and self-supervised learning
  2. Upper Bound Derivation: Obtains tractable upper bounds through rigorous mathematical derivation
  3. Prototype Bias Analysis: Quantifies the bias introduced by self-supervised approximation and analyzes its impact
  4. Balanced Loss Design: Proposes improved loss functions based on theoretical analysis

Experimental Setup

Datasets

  • Primary Dataset: ImageNet (1,281,167 training images, 50,000 validation images, 1,000 classes)
  • Supplementary Dataset: CIFAR-10 (50,000 training images, 10,000 test images, 10 classes)
  • Imbalanced Dataset: ImageNet-LT (115,846 images, following Pareto distribution)

Evaluation Metrics

  • Linear Evaluation: Top-1 accuracy of training a linear classifier on frozen pretrained backbone
  • k-NN Evaluation: k-NN classification accuracy based on representation similarity

Comparison Methods

  • Baseline Methods: SimCLR and its variants
  • Loss Function Variants:
    • Balanced contrastive loss
    • Generalized NT-Xent loss
    • Decoupled contrastive loss

Implementation Details

  • Network Architecture: ResNet-50 backbone + 3-layer MLP projector
  • Training Configuration: Batch size 512, 100 epochs, SGD optimizer
  • Data Augmentation: Random cropping, color distortion, grayscale conversion, Gaussian blur, horizontal flip

Experimental Results

Main Results

Theoretical Verification Experiments

  1. Relationship Between Prototype Representation Bias and Performance:
    • Baseline SimCLR: 65.98% accuracy, 36.72 bias
    • Removing Gaussian blur: 64.57% accuracy, 37.43 bias
    • Adding random rotation: 63.30% accuracy, 38.11 bias
    • Finding: Lower prototype representation bias corresponds to higher accuracy
  2. Impact of Similarity Measures:
    • Cosine similarity + normalization: 65.98%
    • Dot product (no normalization): 0.43%
    • Negative Euclidean distance (no normalization): 10.63%
  3. Impact of Data Balance:
    • Uniform distribution: 20.82%
    • Long-tail distribution: 13.65%

Balanced Parameter Experiments

ImageNet Results

  • Balanced Contrastive Loss: Best performance at (α=4, λ=2) reaches 67.40%
  • Generalized NT-Xent Loss: Best performance at (α=2, λ=2) reaches 66.85%
  • Performance Improvement: Balanced contrastive loss improves approximately 1.5% over standard NT-Xent

CIFAR-10 Results

  • Balanced Contrastive Loss: Best performance at (α=1, λ=4) reaches 86.08%
  • Generalized NT-Xent Loss: Best performance at (α=2, λ=2) reaches 85.85%

Ablation Studies

Impact of Data Augmentation Strategies

Verifies theoretical predictions by adding/removing different transformations:

  • Removing color distortion: Performance drops to 62.56%
  • Adding random cutout: Performance improves to 65.76%
  • Baseline configuration: 65.98%

Tightness Analysis of Upper Bounds

  • Attraction Term Upper Bound: Gap gradually decreases and stabilizes during training
  • Repulsion Term Upper Bound: Maintains larger but controllable gap compared to attraction term

Contrastive Learning Losses

  • Historical Development: From Chopra et al. (2005) contrastive loss to triplet loss and InfoNCE loss
  • This Paper's Contribution: Provides new theoretical perspective based on supervised learning approximation

Self-supervised Learning Theory

  • Existing Perspectives:
    • Mutual information maximization perspective
    • Covariance learning unified perspective
    • Spectral embedding learning perspective
  • This Paper's Innovation: First establishes explicit theoretical connections with supervised learning

Contrastive Learning Practice

  • Architecture Design: Siamese networks, momentum encoders, stop-gradient operations
  • Theoretical Explanation: This paper provides theoretical foundations for these practices

Conclusions and Discussion

Main Conclusions

  1. Theoretical Unification: Successfully establishes theoretical bridges between supervised and self-supervised learning
  2. Practical Guidance: Provides theoretical explanations for common practices in contrastive learning
  3. Method Improvement: Balanced contrastive loss proposed based on theoretical analysis achieves performance improvements

Limitations

  1. Assumption Constraints: Theoretical analysis relies on assumptions such as cosine similarity, L2 normalization, and balanced datasets
  2. Approximation Error: The bias introduced by self-supervised approximation requires further investigation
  3. Experimental Scope: Primarily validated on image classification tasks; applicability to other domains remains to be explored

Future Directions

  1. Theoretical Extensions: Relax existing assumptions and construct more general theoretical frameworks
  2. Method Improvement: Design more effective self-supervised algorithms based on bias analysis
  3. Application Extensions: Extend the theoretical framework to other modalities and tasks

In-Depth Evaluation

Strengths

Theoretical Contributions

  1. Strong Novelty: First provides formal theoretical connections between supervised and self-supervised learning
  2. Rigorous Derivation: Complete mathematical derivation with all proofs provided in appendices
  3. Deep Insights: The prototype representation bias concept provides new perspectives for understanding self-supervised learning

Experimental Validation

  1. Reasonable Design: Experiments are tightly aligned with theoretical predictions and comprehensively validated
  2. Convincing Results: Theoretical predictions are highly consistent with experimental results
  3. Comprehensive Analysis: Validates the theoretical framework from multiple perspectives

Practical Value

  1. Method Improvement: Balanced contrastive loss achieves actual performance improvements
  2. Guiding Significance: Provides theoretical guidance for self-supervised learning algorithm design
  3. Reproducibility: Provides complete code and implementation details

Weaknesses

Theoretical Limitations

  1. Strong Assumptions: Theoretical analysis relies on multiple restrictive assumptions that may limit applicability
  2. Rough Approximations: Some theoretical derivations may introduce significant approximation errors
  3. Generalization Verification Pending: Applicability of the theoretical framework in other domains remains insufficiently verified

Experimental Limitations

  1. Limited Datasets: Primarily validated on ImageNet and CIFAR-10, lacking more diverse evaluation
  2. Single Task: Focuses mainly on image classification; validation on other vision tasks is insufficient
  3. Limited Comparison Methods: Primarily compares with SimCLR family methods, lacking comparisons with other self-supervised methods

Impact

Academic Contributions

  1. Theoretical Foundations: Provides important theoretical foundations for the self-supervised learning field
  2. Research Inspiration: May inspire more theoretical analysis work
  3. Method Guidance: Provides theoretical guidance for subsequent algorithm design

Practical Value

  1. Performance Improvement: Balanced contrastive loss achieves actual performance improvements
  2. Design Principles: Provides algorithm design principles for practitioners
  3. Hyperparameter Guidance: Provides theoretical basis for hyperparameter selection

Applicable Scenarios

  1. Research Scenarios: Suitable for self-supervised learning algorithm research requiring theoretical guidance
  2. Industrial Applications: Suitable for computer vision applications requiring high-quality representations
  3. Educational Use: Suitable as teaching material for understanding self-supervised learning principles

References

This paper cites important works in self-supervised learning, contrastive learning, and representation learning, including:

  • Chen et al. (2020a): SimCLR framework
  • He et al. (2020): MoCo method
  • Oord et al. (2018): InfoNCE loss
  • Wang & Isola (2020): Alignment and uniformity analysis of contrastive learning

Overall Assessment: This is a high-quality theoretical analysis paper that successfully establishes theoretical bridges between supervised and self-supervised learning, providing important insights into understanding the effectiveness of contrastive learning. Although there are limitations in theoretical assumptions, its contributions are significant for advancing the theoretical development of self-supervised learning.