Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.
Understanding Self-supervised Contrastive Learning through Supervised Objectives
- Paper ID: 2510.10572
- Title: Understanding Self-supervised Contrastive Learning through Supervised Objectives
- Author: Byeongchan Lee (KAIST)
- Classification: cs.LG (Machine Learning)
- Publication Venue: Transactions on Machine Learning Research (10/2025)
- Paper Link: https://arxiv.org/abs/2510.10572
Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. This paper provides a theoretical perspective by formulating self-supervised representation learning as an approximation of supervised representation learning objectives. Based on this formulation, the authors derive loss functions closely related to popular contrastive losses such as InfoNCE, providing insights into their underlying principles. The derivation naturally introduces the concepts of prototype representation bias and balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms.
- Missing Theoretical Understanding: Although self-supervised learning has achieved empirical success, its theoretical foundations remain incomplete, lacking deep understanding of why these methods work.
- Empirical Method Design: Existing self-supervised learning methods advance primarily through architectural innovations rather than from formalized objectives, lacking theoretical guidance.
- Unclear Relationship Between Supervised and Self-supervised Learning: The intrinsic connections between supervised and self-supervised learning have not been adequately elucidated.
- Building Theoretical Foundations: Provide solid theoretical foundations for self-supervised learning and explain the fundamental reasons for its effectiveness
- Guiding Method Improvement: Provide principled guidance for algorithm design through theoretical analysis
- Bridging Supervised and Self-supervised Learning: Establish theoretical connections between the two learning paradigms
- Theoretical Framework Construction: Proposes a theoretical framework that formulates self-supervised representation learning as an approximation of supervised representation learning, deriving contrastive loss functions closely related to InfoNCE
- Theoretical Insights: Provides theoretical explanations for common practices in contrastive learning, such as representation normalization and balanced dataset usage
- Concept Introduction: Introduces the concept of prototype representation bias and observes its correlation with downstream performance
- Method Improvement: Proposes balanced contrastive loss as a natural extension of InfoNCE loss, achieving better performance through improved balance
Defines the representation learning task as learning an encoder fθ:X→Rd∖{0} such that:
- Representations of images with the same visual concept cluster together
- Representations of images with different visual concepts separate from each other
First formulates supervised learning as a prototype optimization problem:
minθ−s(fθ(t(x)),μy)+λmaxy′=ys(fθ(t(x)),μy′)
where:
- s(⋅,⋅) is a similarity measure (cosine similarity)
- μy is the prototype representation for label y
- λ>0 is a balancing parameter
Defines prototype representation as the expectation of representations for images with the same label:
μ^y:=ET,X∣yfθ(T(X))
In the self-supervised setting, uses a surrogate prototype representation:
μ~:=ETfθ(T(x))
Under cosine similarity and L2 normalization assumptions:
−s(fθ(t(x)),ETfθ(T(x)))≤−ETs(fθ(t(x)),fθ(T(x)))
Under balanced dataset assumptions:
maxy′=ys(fθ(t(x)),ET′,X′∣y′fθ(T′(X′)))≤ET′[να1logEX′exp(αs(fθ(t(x)),fθ(T′(X′))))]+να1logn
Combining the above upper bounds yields:
l~(θ)=α∣T^∣1∑t′∈T^[−log(∑x′∈X^exp(αs(fθ(t(x)),fθ(t′(x′)))))λ/νexp(αs(fθ(t(x)),fθ(t′(x))))]
- Theoretical Bridge: First establishes formal theoretical connections between supervised and self-supervised learning
- Upper Bound Derivation: Obtains tractable upper bounds through rigorous mathematical derivation
- Prototype Bias Analysis: Quantifies the bias introduced by self-supervised approximation and analyzes its impact
- Balanced Loss Design: Proposes improved loss functions based on theoretical analysis
- Primary Dataset: ImageNet (1,281,167 training images, 50,000 validation images, 1,000 classes)
- Supplementary Dataset: CIFAR-10 (50,000 training images, 10,000 test images, 10 classes)
- Imbalanced Dataset: ImageNet-LT (115,846 images, following Pareto distribution)
- Linear Evaluation: Top-1 accuracy of training a linear classifier on frozen pretrained backbone
- k-NN Evaluation: k-NN classification accuracy based on representation similarity
- Baseline Methods: SimCLR and its variants
- Loss Function Variants:
- Balanced contrastive loss
- Generalized NT-Xent loss
- Decoupled contrastive loss
- Network Architecture: ResNet-50 backbone + 3-layer MLP projector
- Training Configuration: Batch size 512, 100 epochs, SGD optimizer
- Data Augmentation: Random cropping, color distortion, grayscale conversion, Gaussian blur, horizontal flip
- Relationship Between Prototype Representation Bias and Performance:
- Baseline SimCLR: 65.98% accuracy, 36.72 bias
- Removing Gaussian blur: 64.57% accuracy, 37.43 bias
- Adding random rotation: 63.30% accuracy, 38.11 bias
- Finding: Lower prototype representation bias corresponds to higher accuracy
- Impact of Similarity Measures:
- Cosine similarity + normalization: 65.98%
- Dot product (no normalization): 0.43%
- Negative Euclidean distance (no normalization): 10.63%
- Impact of Data Balance:
- Uniform distribution: 20.82%
- Long-tail distribution: 13.65%
- Balanced Contrastive Loss: Best performance at (α=4, λ=2) reaches 67.40%
- Generalized NT-Xent Loss: Best performance at (α=2, λ=2) reaches 66.85%
- Performance Improvement: Balanced contrastive loss improves approximately 1.5% over standard NT-Xent
- Balanced Contrastive Loss: Best performance at (α=1, λ=4) reaches 86.08%
- Generalized NT-Xent Loss: Best performance at (α=2, λ=2) reaches 85.85%
Verifies theoretical predictions by adding/removing different transformations:
- Removing color distortion: Performance drops to 62.56%
- Adding random cutout: Performance improves to 65.76%
- Baseline configuration: 65.98%
- Attraction Term Upper Bound: Gap gradually decreases and stabilizes during training
- Repulsion Term Upper Bound: Maintains larger but controllable gap compared to attraction term
- Historical Development: From Chopra et al. (2005) contrastive loss to triplet loss and InfoNCE loss
- This Paper's Contribution: Provides new theoretical perspective based on supervised learning approximation
- Existing Perspectives:
- Mutual information maximization perspective
- Covariance learning unified perspective
- Spectral embedding learning perspective
- This Paper's Innovation: First establishes explicit theoretical connections with supervised learning
- Architecture Design: Siamese networks, momentum encoders, stop-gradient operations
- Theoretical Explanation: This paper provides theoretical foundations for these practices
- Theoretical Unification: Successfully establishes theoretical bridges between supervised and self-supervised learning
- Practical Guidance: Provides theoretical explanations for common practices in contrastive learning
- Method Improvement: Balanced contrastive loss proposed based on theoretical analysis achieves performance improvements
- Assumption Constraints: Theoretical analysis relies on assumptions such as cosine similarity, L2 normalization, and balanced datasets
- Approximation Error: The bias introduced by self-supervised approximation requires further investigation
- Experimental Scope: Primarily validated on image classification tasks; applicability to other domains remains to be explored
- Theoretical Extensions: Relax existing assumptions and construct more general theoretical frameworks
- Method Improvement: Design more effective self-supervised algorithms based on bias analysis
- Application Extensions: Extend the theoretical framework to other modalities and tasks
- Strong Novelty: First provides formal theoretical connections between supervised and self-supervised learning
- Rigorous Derivation: Complete mathematical derivation with all proofs provided in appendices
- Deep Insights: The prototype representation bias concept provides new perspectives for understanding self-supervised learning
- Reasonable Design: Experiments are tightly aligned with theoretical predictions and comprehensively validated
- Convincing Results: Theoretical predictions are highly consistent with experimental results
- Comprehensive Analysis: Validates the theoretical framework from multiple perspectives
- Method Improvement: Balanced contrastive loss achieves actual performance improvements
- Guiding Significance: Provides theoretical guidance for self-supervised learning algorithm design
- Reproducibility: Provides complete code and implementation details
- Strong Assumptions: Theoretical analysis relies on multiple restrictive assumptions that may limit applicability
- Rough Approximations: Some theoretical derivations may introduce significant approximation errors
- Generalization Verification Pending: Applicability of the theoretical framework in other domains remains insufficiently verified
- Limited Datasets: Primarily validated on ImageNet and CIFAR-10, lacking more diverse evaluation
- Single Task: Focuses mainly on image classification; validation on other vision tasks is insufficient
- Limited Comparison Methods: Primarily compares with SimCLR family methods, lacking comparisons with other self-supervised methods
- Theoretical Foundations: Provides important theoretical foundations for the self-supervised learning field
- Research Inspiration: May inspire more theoretical analysis work
- Method Guidance: Provides theoretical guidance for subsequent algorithm design
- Performance Improvement: Balanced contrastive loss achieves actual performance improvements
- Design Principles: Provides algorithm design principles for practitioners
- Hyperparameter Guidance: Provides theoretical basis for hyperparameter selection
- Research Scenarios: Suitable for self-supervised learning algorithm research requiring theoretical guidance
- Industrial Applications: Suitable for computer vision applications requiring high-quality representations
- Educational Use: Suitable as teaching material for understanding self-supervised learning principles
This paper cites important works in self-supervised learning, contrastive learning, and representation learning, including:
- Chen et al. (2020a): SimCLR framework
- He et al. (2020): MoCo method
- Oord et al. (2018): InfoNCE loss
- Wang & Isola (2020): Alignment and uniformity analysis of contrastive learning
Overall Assessment: This is a high-quality theoretical analysis paper that successfully establishes theoretical bridges between supervised and self-supervised learning, providing important insights into understanding the effectiveness of contrastive learning. Although there are limitations in theoretical assumptions, its contributions are significant for advancing the theoretical development of self-supervised learning.