2025-11-19T09:43:12.754426

Understanding Self-supervised Contrastive Learning through Supervised Objectives

Lee

Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.

academic

Understanding Self-supervised Contrastive Learning through Supervised Objectives

Basic Information

Paper ID: 2510.10572
Title: Understanding Self-supervised Contrastive Learning through Supervised Objectives
Author: Byeongchan Lee (KAIST)
Classification: cs.LG (Machine Learning)
Publication Venue: Transactions on Machine Learning Research (10/2025)
Paper Link: https://arxiv.org/abs/2510.10572

Abstract

Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. This paper provides a theoretical perspective by formulating self-supervised representation learning as an approximation of supervised representation learning objectives. Based on this formulation, the authors derive loss functions closely related to popular contrastive losses such as InfoNCE, providing insights into their underlying principles. The derivation naturally introduces the concepts of prototype representation bias and balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms.

Research Background and Motivation

Core Problems

Missing Theoretical Understanding: Although self-supervised learning has achieved empirical success, its theoretical foundations remain incomplete, lacking deep understanding of why these methods work.
Empirical Method Design: Existing self-supervised learning methods advance primarily through architectural innovations rather than from formalized objectives, lacking theoretical guidance.
Unclear Relationship Between Supervised and Self-supervised Learning: The intrinsic connections between supervised and self-supervised learning have not been adequately elucidated.

Research Motivation

Building Theoretical Foundations: Provide solid theoretical foundations for self-supervised learning and explain the fundamental reasons for its effectiveness
Guiding Method Improvement: Provide principled guidance for algorithm design through theoretical analysis
Bridging Supervised and Self-supervised Learning: Establish theoretical connections between the two learning paradigms

Core Contributions

Theoretical Framework Construction: Proposes a theoretical framework that formulates self-supervised representation learning as an approximation of supervised representation learning, deriving contrastive loss functions closely related to InfoNCE
Theoretical Insights: Provides theoretical explanations for common practices in contrastive learning, such as representation normalization and balanced dataset usage
Concept Introduction: Introduces the concept of prototype representation bias and observes its correlation with downstream performance
Method Improvement: Proposes balanced contrastive loss as a natural extension of InfoNCE loss, achieving better performance through improved balance

Methodology Details

Task Definition

Defines the representation learning task as learning an encoder $f_θ: \mathcal{X} → \mathbb{R}^d \setminus \{0\}$ such that:

Representations of images with the same visual concept cluster together
Representations of images with different visual concepts separate from each other

Theoretical Framework

Supervised Representation Learning Problem

First formulates supervised learning as a prototype optimization problem: $\min_θ -s(f_θ(t(x)), μ_y) + λ \max_{y' ≠ y} s(f_θ(t(x)), μ_{y'})$

where:

$s(·,·)$ is a similarity measure (cosine similarity)
$μ_y$ is the prototype representation for label $y$
$λ > 0$ is a balancing parameter

Prototype Representation Construction

Defines prototype representation as the expectation of representations for images with the same label: $\hat{μ}_y := \mathbb{E}_{T,X|y}f_θ(T(X))$

Self-supervised Approximation

In the self-supervised setting, uses a surrogate prototype representation: $\tilde{μ} := \mathbb{E}_T f_θ(T(x))$

Theoretical Derivation

Attraction Term Upper Bound (Theorem 4.4)

Under cosine similarity and L2 normalization assumptions: $-s(f_θ(t(x)), \mathbb{E}_T f_θ(T(x))) ≤ -\mathbb{E}_T s(f_θ(t(x)), f_θ(T(x)))$

Repulsion Term Upper Bound (Theorem 4.6)

Under balanced dataset assumptions: $\max_{y' ≠ y} s(f_θ(t(x)), \mathbb{E}_{T',X'|y'}f_θ(T'(X'))) ≤ \mathbb{E}_{T'}\left[\frac{1}{να}\log\mathbb{E}_{X'}\exp(αs(f_θ(t(x)), f_θ(T'(X'))))\right] + \frac{1}{να}\log n$

Total Loss Function

Combining the above upper bounds yields: $\tilde{l}(θ) = \frac{1}{α|\hat{T}|}\sum_{t' ∈ \hat{T}}\left[-\log\frac{\exp(αs(f_θ(t(x)), f_θ(t'(x))))}{\left(\sum_{x' ∈ \hat{X}}\exp(αs(f_θ(t(x)), f_θ(t'(x'))))\right)^{λ/ν}}\right]$

Technical Innovations

Theoretical Bridge: First establishes formal theoretical connections between supervised and self-supervised learning
Upper Bound Derivation: Obtains tractable upper bounds through rigorous mathematical derivation
Prototype Bias Analysis: Quantifies the bias introduced by self-supervised approximation and analyzes its impact
Balanced Loss Design: Proposes improved loss functions based on theoretical analysis

Experimental Setup

Datasets

Primary Dataset: ImageNet (1,281,167 training images, 50,000 validation images, 1,000 classes)
Supplementary Dataset: CIFAR-10 (50,000 training images, 10,000 test images, 10 classes)
Imbalanced Dataset: ImageNet-LT (115,846 images, following Pareto distribution)

Evaluation Metrics

Linear Evaluation: Top-1 accuracy of training a linear classifier on frozen pretrained backbone
k-NN Evaluation: k-NN classification accuracy based on representation similarity

Comparison Methods

Baseline Methods: SimCLR and its variants
Loss Function Variants:
- Balanced contrastive loss
- Generalized NT-Xent loss
- Decoupled contrastive loss

Implementation Details

Network Architecture: ResNet-50 backbone + 3-layer MLP projector
Training Configuration: Batch size 512, 100 epochs, SGD optimizer
Data Augmentation: Random cropping, color distortion, grayscale conversion, Gaussian blur, horizontal flip

Experimental Results

Main Results

Theoretical Verification Experiments

Relationship Between Prototype Representation Bias and Performance:
- Baseline SimCLR: 65.98% accuracy, 36.72 bias
- Removing Gaussian blur: 64.57% accuracy, 37.43 bias
- Adding random rotation: 63.30% accuracy, 38.11 bias
- Finding: Lower prototype representation bias corresponds to higher accuracy
Impact of Similarity Measures:
- Cosine similarity + normalization: 65.98%
- Dot product (no normalization): 0.43%
- Negative Euclidean distance (no normalization): 10.63%
Impact of Data Balance:
- Uniform distribution: 20.82%
- Long-tail distribution: 13.65%

Balanced Parameter Experiments

ImageNet Results

Balanced Contrastive Loss: Best performance at (α=4, λ=2) reaches 67.40%
Generalized NT-Xent Loss: Best performance at (α=2, λ=2) reaches 66.85%
Performance Improvement: Balanced contrastive loss improves approximately 1.5% over standard NT-Xent

CIFAR-10 Results

Balanced Contrastive Loss: Best performance at (α=1, λ=4) reaches 86.08%
Generalized NT-Xent Loss: Best performance at (α=2, λ=2) reaches 85.85%

Ablation Studies

Impact of Data Augmentation Strategies

Verifies theoretical predictions by adding/removing different transformations:

Removing color distortion: Performance drops to 62.56%
Adding random cutout: Performance improves to 65.76%
Baseline configuration: 65.98%

Tightness Analysis of Upper Bounds

Attraction Term Upper Bound: Gap gradually decreases and stabilizes during training
Repulsion Term Upper Bound: Maintains larger but controllable gap compared to attraction term

Contrastive Learning Losses

Historical Development: From Chopra et al. (2005) contrastive loss to triplet loss and InfoNCE loss
This Paper's Contribution: Provides new theoretical perspective based on supervised learning approximation

Self-supervised Learning Theory

Existing Perspectives:
- Mutual information maximization perspective
- Covariance learning unified perspective
- Spectral embedding learning perspective
This Paper's Innovation: First establishes explicit theoretical connections with supervised learning

Contrastive Learning Practice

Architecture Design: Siamese networks, momentum encoders, stop-gradient operations
Theoretical Explanation: This paper provides theoretical foundations for these practices

Conclusions and Discussion

Main Conclusions

Theoretical Unification: Successfully establishes theoretical bridges between supervised and self-supervised learning
Practical Guidance: Provides theoretical explanations for common practices in contrastive learning
Method Improvement: Balanced contrastive loss proposed based on theoretical analysis achieves performance improvements

Limitations

Assumption Constraints: Theoretical analysis relies on assumptions such as cosine similarity, L2 normalization, and balanced datasets
Approximation Error: The bias introduced by self-supervised approximation requires further investigation
Experimental Scope: Primarily validated on image classification tasks; applicability to other domains remains to be explored

Future Directions

Theoretical Extensions: Relax existing assumptions and construct more general theoretical frameworks
Method Improvement: Design more effective self-supervised algorithms based on bias analysis
Application Extensions: Extend the theoretical framework to other modalities and tasks

In-Depth Evaluation

Strengths

Theoretical Contributions

Strong Novelty: First provides formal theoretical connections between supervised and self-supervised learning
Rigorous Derivation: Complete mathematical derivation with all proofs provided in appendices
Deep Insights: The prototype representation bias concept provides new perspectives for understanding self-supervised learning

Experimental Validation

Reasonable Design: Experiments are tightly aligned with theoretical predictions and comprehensively validated
Convincing Results: Theoretical predictions are highly consistent with experimental results
Comprehensive Analysis: Validates the theoretical framework from multiple perspectives

Practical Value

Method Improvement: Balanced contrastive loss achieves actual performance improvements
Guiding Significance: Provides theoretical guidance for self-supervised learning algorithm design
Reproducibility: Provides complete code and implementation details

Weaknesses

Theoretical Limitations

Strong Assumptions: Theoretical analysis relies on multiple restrictive assumptions that may limit applicability
Rough Approximations: Some theoretical derivations may introduce significant approximation errors
Generalization Verification Pending: Applicability of the theoretical framework in other domains remains insufficiently verified

Experimental Limitations

Limited Datasets: Primarily validated on ImageNet and CIFAR-10, lacking more diverse evaluation
Single Task: Focuses mainly on image classification; validation on other vision tasks is insufficient
Limited Comparison Methods: Primarily compares with SimCLR family methods, lacking comparisons with other self-supervised methods

Impact

Academic Contributions

Theoretical Foundations: Provides important theoretical foundations for the self-supervised learning field
Research Inspiration: May inspire more theoretical analysis work
Method Guidance: Provides theoretical guidance for subsequent algorithm design

Practical Value

Performance Improvement: Balanced contrastive loss achieves actual performance improvements
Design Principles: Provides algorithm design principles for practitioners
Hyperparameter Guidance: Provides theoretical basis for hyperparameter selection

Applicable Scenarios

Research Scenarios: Suitable for self-supervised learning algorithm research requiring theoretical guidance
Industrial Applications: Suitable for computer vision applications requiring high-quality representations
Educational Use: Suitable as teaching material for understanding self-supervised learning principles

References

This paper cites important works in self-supervised learning, contrastive learning, and representation learning, including:

Chen et al. (2020a): SimCLR framework
He et al. (2020): MoCo method
Oord et al. (2018): InfoNCE loss
Wang & Isola (2020): Alignment and uniformity analysis of contrastive learning

Overall Assessment: This is a high-quality theoretical analysis paper that successfully establishes theoretical bridges between supervised and self-supervised learning, providing important insights into understanding the effectiveness of contrastive learning. Although there are limitations in theoretical assumptions, its contributions are significant for advancing the theoretical development of self-supervised learning.