2025-11-24T02:10:17.177762

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Luthra, Mishra, Galanti

Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

academic

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Basic Information

Paper ID: 2510.08852
Title: On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Authors: Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti (Texas A&M University)
Category: cs.LG
Publication Date: October 9, 2025 (Preprint)
Paper Link: https://arxiv.org/abs/2510.08852v1

Abstract

Self-supervised contrastive learning (CL) has achieved remarkable empirical success, typically producing representations comparable to supervised pretraining. Recent theoretical work has explained this phenomenon, demonstrating that as the number of classes grows, the CL loss tightly approximates a supervised proxy—negative-sample-only supervised contrastive learning (NSCL) loss. However, this loss-level similarity leaves an open question: do CL and NSCL remain aligned at the representation level throughout training, not merely in the objective function?

This paper addresses this question by analyzing representation alignment between CL and NSCL models trained under shared randomness (identical initialization, batches, and data augmentation). The study demonstrates that their induced representations remain similar: specifically, it proves that under realistic conditions, the similarity matrices of CL and NSCL stay close. The bounds provide high-probability guarantees for alignment metrics such as Centered Kernel Alignment (CKA) and Representational Similarity Analysis (RSA), and clarify how alignment improves with more classes, higher temperature, and its dependence on batch size.

Research Background and Motivation

Core Problem

The core problem addressed in this paper is: Do self-supervised contrastive learning (CL) and negative-sample-only supervised contrastive learning (NSCL) remain aligned at the representation level during training?

Research Motivation

Gap Between Empirical Success and Theoretical Explanation: While CL performs excellently in practice, why it learns features aligned with semantic class boundaries remains mysterious
Insufficiency of Loss-Level Similarity: Prior work (Luthra et al., 2025) only proved CL and NSCL similarity at the loss function level, which cannot guarantee consistency of optimization trajectories
Importance of Representation Alignment: Loss-level similarity cannot ensure that parameters and representations remain coupled during training, as they may diverge due to differences in curvature, gradient noise, or learning rate scheduling

Limitations of Existing Approaches

Mutual Information Maximization Perspective: Early theory linked CL to inter-view mutual information maximization, but excessive constraints reduce downstream performance
Alignment and Uniformity: While geometrically intuitive, these criteria cannot fully explain how different semantic classes are organized under CL training
Cluster Recovery Theory: Most results rely on restrictive assumptions, such as conditional independence of augmentations given cluster identity

Core Contributions

Theoretical Contributions:
- Proves that under shared randomness, similarity matrices of CL and NSCL remain close during training
- Provides high-probability lower bounds for CKA and RSA alignment metrics
- Reveals how alignment varies with number of classes, temperature parameter, and batch size
Methodological Innovation:
- Shifts analysis from parameter space to representation space, avoiding inherent instability of parameter space coupling
- Establishes "similarity descent" proxy dynamics that faithfully track similarity evolution induced by parameter space SGD
Experimental Validation:
- Validates theoretical predictions on multiple datasets
- Demonstrates that NSCL is closer to CL than other supervised methods
- Confirms alignment enhancement with scale and temperature

Detailed Methodology

Task Definition

Given a class-balanced dataset $S = \{(x_i, y_i)\}_{i=1}^N \subset \mathcal{X} \times [C]$ , where each class has $n$ samples ( $N = Cn$ ). Encoder $f_w: \mathcal{X} \to \mathbb{R}^d$ maps inputs to embeddings.

Core Method: Similarity Space Analysis

1. Similarity Matrix Dynamics

Let $\Sigma_t \in [-1,1]^{N \times N}$ be the pairwise similarity matrix of a fixed reference set at step $t$ . Analyze coupled evolution of CL and NSCL similarities: $\Sigma^{CL}_t, \Sigma^{NSCL}_t \in [-1,1]^{N \times N}$

2. Proxy Similarity Descent

For realized minibatch $B_t = \{(x_j, x'_j, y_j)\}_{j=1}^B$ , define batch gradient mappings: $G^{CL}_t := \nabla_\Sigma \bar{\ell}^{CL}_{B_t}(\Sigma^{CL}_t), \quad G^{NSCL}_t := \nabla_\Sigma \bar{\ell}^{NSCL}_{B_t}(\Sigma^{NSCL}_t)$

Proxy updates: $\Sigma^{CL}_{t+1} = \Sigma^{CL}_t - \eta_t G^{CL}_t, \quad \Sigma^{NSCL}_{t+1} = \Sigma^{NSCL}_t - \eta_t G^{NSCL}_t$

Main Theoretical Results

Theorem 1: Similarity Space Coupling

With probability at least $1-\delta$ , for any step size sequence $(\eta_t)_{t=0}^{T-1}$ : $\|\Sigma^{CL}_T - \Sigma^{NSCL}_T\|_F \leq \exp\left(\frac{1}{2\tau^2 B}\sum_{t=0}^{T-1}\eta_t\right) \frac{1}{\tau\sqrt{B}}\left(\sum_{t=0}^{T-1}\eta_t\right)\Delta_{C,\delta}(B;\tau)$

where $\Delta_{C,\delta}(B;\tau) = \frac{2e^{2/\tau}(\frac{1}{C}+\epsilon_{B,\delta})}{1-\frac{1}{C}-\epsilon_{B,\delta}}$ , $\epsilon_{B,\delta} = \sqrt{\frac{1}{2B}\log(\frac{TB}{\delta})}$ .

CKA and RSA Lower Bounds

Corollary 1 (CKA Lower Bound): Under the setting of Theorem 1, with probability at least $1-\delta$ : $CKA_T \geq \frac{1-\rho_T}{1+\rho_T}$

Corollary 2 (RSA Lower Bound): Similarly: $RSA_T \geq \frac{1-r_T}{1+r_T}$

Technical Innovations

From Parameter Space to Representation Space: Avoids exponential divergence issues in parameter space
Block Orthogonality Exploitation: Leverages orthogonality of gradients from different anchors to simplify analysis
Temperature-Modulated Stability: The $\frac{1}{\tau^2 B}$ term in the exponential factor makes similarity space more stable than parameter space

Experimental Setup

Datasets

CIFAR-10/100: 50,000 training images, 10,000 validation images
Mini-ImageNet: 100-class subset of ImageNet-1K
Tiny-ImageNet: 100,000 64×64 images, 200 classes
ImageNet-1K: Full ImageNet dataset

Evaluation Metrics

Linear CKA (Centered Kernel Alignment): Normalized Frobenius inner product of centered similarity matrices
RSA (Representational Similarity Analysis): Pearson correlation of off-diagonal elements of representation dissimilarity matrices
Nearest Class Center Classifier (NCCC) and Linear Probe (LP) accuracy

Comparison Methods

NSCL: Negative-sample-only supervised contrastive learning
SCL: Supervised contrastive learning (Khosla et al., 2020)
CE: Cross-entropy loss

Implementation Details

Architecture: ResNet-50 encoder + two-layer MLP projection head
Optimizer: LARS optimizer, momentum 0.9, weight decay 1e-6
Batch Size: 1024
Learning Rate: Base learning rate 0.3, scaled with batch size
Training Strategy: 10 epochs warmup + cosine learning rate schedule

Experimental Results

Main Results

1. Alignment Comparison Across Supervision Methods

NSCL consistently achieves highest alignment with CL across all datasets:

Tiny-ImageNet: CL-NSCL CKA reaches 0.87 after 1000 epochs, while CL-SCL only 0.043
Alignment Ranking: NSCL > CE > SCL

2. Impact of Number of Classes on Alignment

Validates theoretical predictions: more classes lead to stronger CL-NSCL alignment

Across all datasets, RSA and CKA values monotonically increase with training class count $C'$
Complete validation from 2 to 1000 classes on ImageNet-1K

3. Impact of Temperature Parameter

High temperature improves alignment, validating theoretical analysis:

Highest alignment at $\tau = 1.0$
Progressively decreases at $\tau = 0.5$ and $\tau = 0.1$
Consistent trend across all datasets

4. Impact of Batch Size

Alignment changes under different learning rate scaling:

O(B) Scaling: Alignment decreases with batch size
O(√B), O(∜B), O(1) Scaling: Alignment increases with batch size
Results consistent with theoretical bound dependencies

Parameter Space vs Representation Space

Weight Space: Parameters of CL and supervised methods diverge rapidly
Representation Space: CKA and RSA maintain high alignment (>0.8)
Demonstrates stability of representation alignment contrasted with parameter divergence

Downstream Task Performance

Dataset	CL(NCCC/LP)	NSCL(NCCC/LP)	SCL(NCCC/LP)	CE(NCCC/LP)
CIFAR-10	88.37/90.16	94.47/94.09	94.93/94.67	92.97/93.39
CIFAR-100	54.62/65.65	60.14/68.38	64.06/69.52	67.35/68.04
Mini-ImageNet	60.78/65.30	63.92/72.60	74.78/76.00	75.20/74.00
Tiny-ImageNet	40.59/44.61	40.76/45.79	48.63/48.73	48.28/52.57

Contrastive Learning Theory

Mutual Information Perspective: Early work linked CL to mutual information maximization, but over-constraint damages performance
Geometric Perspective: Alignment and uniformity properties, but cannot fully explain semantic class organization
Cluster Recovery: Most depend on restrictive assumptions, such as conditional independence

Supervised Learning Connections

Linear Models: Self-supervised objectives like VicReg align with supervised quadratic loss
Label-Free Bounds: Based on Luthra et al. (2025) establishing explicit coupling between CL and NSCL

Other Theoretical Studies

Feature learning dynamics, role of augmentation, projection head analysis, sample complexity, etc.

Conclusions and Discussion

Main Conclusions

Stability of Representation Alignment: CL and NSCL maintain tight coupling in representation space despite potential parameter divergence
Consistency Between Theory and Practice: Experiments validate theoretical predictions of class count, temperature, and batch size effects
NSCL as Bridge: NSCL tracks CL better than other supervised methods, serving as a principled bridge between self-supervised and supervised learning

Limitations

Tightness of Bounds: Theoretical bounds may be overly loose for large-scale, long-training scenarios
Worst-Case Analysis: Uses uniform high-probability concentration bounds, favoring generality over tightness
Exponential Factor: The exponential factor may invalidate bounds for large-scale training beyond the first few epochs

Future Directions

Tighter Bounds: Exploit data-dependent structure rather than worst-case bounds
Extension to Other SSL Paradigms: Extend framework to non-contrastive methods
Practical Improvements: Improve guarantee practicality while maintaining stability

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: First to establish rigorous theoretical guarantees for CL-NSCL alignment in representation space
Methodological Innovation: Novel and effective approach of shifting analysis from parameter space to similarity space
Comprehensive Experiments: Multi-dataset, multi-perspective validation of theoretical predictions with sound experimental design
Practical Value: Provides new perspective for understanding success mechanisms of self-supervised learning

Weaknesses

Practical Utility of Bounds: Theoretical bounds may be overly loose for practical applications
Assumption Limitations: Shared randomness assumption may be unrealistic in practical applications
Method Scope: Only considers contrastive learning paradigm, does not address other SSL methods

Impact

Theoretical Significance: Important supplement to self-supervised learning theory
Methodological Inspiration: Similarity space analysis method may inspire subsequent research
Practical Guidance: Provides theoretical basis for selecting appropriate supervised proxies

Applicable Scenarios

Research requiring understanding of relationships between self-supervised and supervised learning
Theoretical analysis of contrastive learning methods
Representation learning stability studies

References

Luthra et al. (2025): Self-supervised contrastive learning is approximately supervised contrastive learning
Chen et al. (2020): A simple framework for contrastive learning of visual representations (SimCLR)
Khosla et al. (2020): Supervised contrastive learning
Kornblith et al. (2019): Similarity of neural network representations revisited (CKA)
Kriegeskorte et al. (2008): Representational similarity analysis

Summary: This paper theoretically establishes deep connections between self-supervised contrastive learning and supervised learning, rigorously proving representation-level alignment through mathematical analysis, providing important insights into understanding the success mechanisms of self-supervised learning. Although the practical utility of theoretical bounds is limited, its methodological innovations and experimental validation make significant contributions to theoretical development in this field.