2025-11-24T02:10:17.177762

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Luthra, Mishra, Galanti
Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].
academic

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Basic Information

  • Paper ID: 2510.08852
  • Title: On the Alignment Between Supervised and Self-Supervised Contrastive Learning
  • Authors: Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti (Texas A&M University)
  • Category: cs.LG
  • Publication Date: October 9, 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.08852v1

Abstract

Self-supervised contrastive learning (CL) has achieved remarkable empirical success, typically producing representations comparable to supervised pretraining. Recent theoretical work has explained this phenomenon, demonstrating that as the number of classes grows, the CL loss tightly approximates a supervised proxy—negative-sample-only supervised contrastive learning (NSCL) loss. However, this loss-level similarity leaves an open question: do CL and NSCL remain aligned at the representation level throughout training, not merely in the objective function?

This paper addresses this question by analyzing representation alignment between CL and NSCL models trained under shared randomness (identical initialization, batches, and data augmentation). The study demonstrates that their induced representations remain similar: specifically, it proves that under realistic conditions, the similarity matrices of CL and NSCL stay close. The bounds provide high-probability guarantees for alignment metrics such as Centered Kernel Alignment (CKA) and Representational Similarity Analysis (RSA), and clarify how alignment improves with more classes, higher temperature, and its dependence on batch size.

Research Background and Motivation

Core Problem

The core problem addressed in this paper is: Do self-supervised contrastive learning (CL) and negative-sample-only supervised contrastive learning (NSCL) remain aligned at the representation level during training?

Research Motivation

  1. Gap Between Empirical Success and Theoretical Explanation: While CL performs excellently in practice, why it learns features aligned with semantic class boundaries remains mysterious
  2. Insufficiency of Loss-Level Similarity: Prior work (Luthra et al., 2025) only proved CL and NSCL similarity at the loss function level, which cannot guarantee consistency of optimization trajectories
  3. Importance of Representation Alignment: Loss-level similarity cannot ensure that parameters and representations remain coupled during training, as they may diverge due to differences in curvature, gradient noise, or learning rate scheduling

Limitations of Existing Approaches

  • Mutual Information Maximization Perspective: Early theory linked CL to inter-view mutual information maximization, but excessive constraints reduce downstream performance
  • Alignment and Uniformity: While geometrically intuitive, these criteria cannot fully explain how different semantic classes are organized under CL training
  • Cluster Recovery Theory: Most results rely on restrictive assumptions, such as conditional independence of augmentations given cluster identity

Core Contributions

  1. Theoretical Contributions:
    • Proves that under shared randomness, similarity matrices of CL and NSCL remain close during training
    • Provides high-probability lower bounds for CKA and RSA alignment metrics
    • Reveals how alignment varies with number of classes, temperature parameter, and batch size
  2. Methodological Innovation:
    • Shifts analysis from parameter space to representation space, avoiding inherent instability of parameter space coupling
    • Establishes "similarity descent" proxy dynamics that faithfully track similarity evolution induced by parameter space SGD
  3. Experimental Validation:
    • Validates theoretical predictions on multiple datasets
    • Demonstrates that NSCL is closer to CL than other supervised methods
    • Confirms alignment enhancement with scale and temperature

Detailed Methodology

Task Definition

Given a class-balanced dataset S={(xi,yi)}i=1NX×[C]S = \{(x_i, y_i)\}_{i=1}^N \subset \mathcal{X} \times [C], where each class has nn samples (N=CnN = Cn). Encoder fw:XRdf_w: \mathcal{X} \to \mathbb{R}^d maps inputs to embeddings.

Core Method: Similarity Space Analysis

1. Similarity Matrix Dynamics

Let Σt[1,1]N×N\Sigma_t \in [-1,1]^{N \times N} be the pairwise similarity matrix of a fixed reference set at step tt. Analyze coupled evolution of CL and NSCL similarities: ΣtCL,ΣtNSCL[1,1]N×N\Sigma^{CL}_t, \Sigma^{NSCL}_t \in [-1,1]^{N \times N}

2. Proxy Similarity Descent

For realized minibatch Bt={(xj,xj,yj)}j=1BB_t = \{(x_j, x'_j, y_j)\}_{j=1}^B, define batch gradient mappings: GtCL:=ΣˉBtCL(ΣtCL),GtNSCL:=ΣˉBtNSCL(ΣtNSCL)G^{CL}_t := \nabla_\Sigma \bar{\ell}^{CL}_{B_t}(\Sigma^{CL}_t), \quad G^{NSCL}_t := \nabla_\Sigma \bar{\ell}^{NSCL}_{B_t}(\Sigma^{NSCL}_t)

Proxy updates: Σt+1CL=ΣtCLηtGtCL,Σt+1NSCL=ΣtNSCLηtGtNSCL\Sigma^{CL}_{t+1} = \Sigma^{CL}_t - \eta_t G^{CL}_t, \quad \Sigma^{NSCL}_{t+1} = \Sigma^{NSCL}_t - \eta_t G^{NSCL}_t

Main Theoretical Results

Theorem 1: Similarity Space Coupling

With probability at least 1δ1-\delta, for any step size sequence (ηt)t=0T1(\eta_t)_{t=0}^{T-1}: ΣTCLΣTNSCLFexp(12τ2Bt=0T1ηt)1τB(t=0T1ηt)ΔC,δ(B;τ)\|\Sigma^{CL}_T - \Sigma^{NSCL}_T\|_F \leq \exp\left(\frac{1}{2\tau^2 B}\sum_{t=0}^{T-1}\eta_t\right) \frac{1}{\tau\sqrt{B}}\left(\sum_{t=0}^{T-1}\eta_t\right)\Delta_{C,\delta}(B;\tau)

where ΔC,δ(B;τ)=2e2/τ(1C+ϵB,δ)11CϵB,δ\Delta_{C,\delta}(B;\tau) = \frac{2e^{2/\tau}(\frac{1}{C}+\epsilon_{B,\delta})}{1-\frac{1}{C}-\epsilon_{B,\delta}}, ϵB,δ=12Blog(TBδ)\epsilon_{B,\delta} = \sqrt{\frac{1}{2B}\log(\frac{TB}{\delta})}.

CKA and RSA Lower Bounds

Corollary 1 (CKA Lower Bound): Under the setting of Theorem 1, with probability at least 1δ1-\delta: CKAT1ρT1+ρTCKA_T \geq \frac{1-\rho_T}{1+\rho_T}

Corollary 2 (RSA Lower Bound): Similarly: RSAT1rT1+rTRSA_T \geq \frac{1-r_T}{1+r_T}

Technical Innovations

  1. From Parameter Space to Representation Space: Avoids exponential divergence issues in parameter space
  2. Block Orthogonality Exploitation: Leverages orthogonality of gradients from different anchors to simplify analysis
  3. Temperature-Modulated Stability: The 1τ2B\frac{1}{\tau^2 B} term in the exponential factor makes similarity space more stable than parameter space

Experimental Setup

Datasets

  • CIFAR-10/100: 50,000 training images, 10,000 validation images
  • Mini-ImageNet: 100-class subset of ImageNet-1K
  • Tiny-ImageNet: 100,000 64×64 images, 200 classes
  • ImageNet-1K: Full ImageNet dataset

Evaluation Metrics

  • Linear CKA (Centered Kernel Alignment): Normalized Frobenius inner product of centered similarity matrices
  • RSA (Representational Similarity Analysis): Pearson correlation of off-diagonal elements of representation dissimilarity matrices
  • Nearest Class Center Classifier (NCCC) and Linear Probe (LP) accuracy

Comparison Methods

  • NSCL: Negative-sample-only supervised contrastive learning
  • SCL: Supervised contrastive learning (Khosla et al., 2020)
  • CE: Cross-entropy loss

Implementation Details

  • Architecture: ResNet-50 encoder + two-layer MLP projection head
  • Optimizer: LARS optimizer, momentum 0.9, weight decay 1e-6
  • Batch Size: 1024
  • Learning Rate: Base learning rate 0.3, scaled with batch size
  • Training Strategy: 10 epochs warmup + cosine learning rate schedule

Experimental Results

Main Results

1. Alignment Comparison Across Supervision Methods

NSCL consistently achieves highest alignment with CL across all datasets:

  • Tiny-ImageNet: CL-NSCL CKA reaches 0.87 after 1000 epochs, while CL-SCL only 0.043
  • Alignment Ranking: NSCL > CE > SCL

2. Impact of Number of Classes on Alignment

Validates theoretical predictions: more classes lead to stronger CL-NSCL alignment

  • Across all datasets, RSA and CKA values monotonically increase with training class count CC'
  • Complete validation from 2 to 1000 classes on ImageNet-1K

3. Impact of Temperature Parameter

High temperature improves alignment, validating theoretical analysis:

  • Highest alignment at τ=1.0\tau = 1.0
  • Progressively decreases at τ=0.5\tau = 0.5 and τ=0.1\tau = 0.1
  • Consistent trend across all datasets

4. Impact of Batch Size

Alignment changes under different learning rate scaling:

  • O(B) Scaling: Alignment decreases with batch size
  • O(√B), O(∜B), O(1) Scaling: Alignment increases with batch size
  • Results consistent with theoretical bound dependencies

Parameter Space vs Representation Space

  • Weight Space: Parameters of CL and supervised methods diverge rapidly
  • Representation Space: CKA and RSA maintain high alignment (>0.8)
  • Demonstrates stability of representation alignment contrasted with parameter divergence

Downstream Task Performance

DatasetCL(NCCC/LP)NSCL(NCCC/LP)SCL(NCCC/LP)CE(NCCC/LP)
CIFAR-1088.37/90.1694.47/94.0994.93/94.6792.97/93.39
CIFAR-10054.62/65.6560.14/68.3864.06/69.5267.35/68.04
Mini-ImageNet60.78/65.3063.92/72.6074.78/76.0075.20/74.00
Tiny-ImageNet40.59/44.6140.76/45.7948.63/48.7348.28/52.57

Contrastive Learning Theory

  1. Mutual Information Perspective: Early work linked CL to mutual information maximization, but over-constraint damages performance
  2. Geometric Perspective: Alignment and uniformity properties, but cannot fully explain semantic class organization
  3. Cluster Recovery: Most depend on restrictive assumptions, such as conditional independence

Supervised Learning Connections

  1. Linear Models: Self-supervised objectives like VicReg align with supervised quadratic loss
  2. Label-Free Bounds: Based on Luthra et al. (2025) establishing explicit coupling between CL and NSCL

Other Theoretical Studies

  • Feature learning dynamics, role of augmentation, projection head analysis, sample complexity, etc.

Conclusions and Discussion

Main Conclusions

  1. Stability of Representation Alignment: CL and NSCL maintain tight coupling in representation space despite potential parameter divergence
  2. Consistency Between Theory and Practice: Experiments validate theoretical predictions of class count, temperature, and batch size effects
  3. NSCL as Bridge: NSCL tracks CL better than other supervised methods, serving as a principled bridge between self-supervised and supervised learning

Limitations

  1. Tightness of Bounds: Theoretical bounds may be overly loose for large-scale, long-training scenarios
  2. Worst-Case Analysis: Uses uniform high-probability concentration bounds, favoring generality over tightness
  3. Exponential Factor: The exponential factor may invalidate bounds for large-scale training beyond the first few epochs

Future Directions

  1. Tighter Bounds: Exploit data-dependent structure rather than worst-case bounds
  2. Extension to Other SSL Paradigms: Extend framework to non-contrastive methods
  3. Practical Improvements: Improve guarantee practicality while maintaining stability

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contribution: First to establish rigorous theoretical guarantees for CL-NSCL alignment in representation space
  2. Methodological Innovation: Novel and effective approach of shifting analysis from parameter space to similarity space
  3. Comprehensive Experiments: Multi-dataset, multi-perspective validation of theoretical predictions with sound experimental design
  4. Practical Value: Provides new perspective for understanding success mechanisms of self-supervised learning

Weaknesses

  1. Practical Utility of Bounds: Theoretical bounds may be overly loose for practical applications
  2. Assumption Limitations: Shared randomness assumption may be unrealistic in practical applications
  3. Method Scope: Only considers contrastive learning paradigm, does not address other SSL methods

Impact

  1. Theoretical Significance: Important supplement to self-supervised learning theory
  2. Methodological Inspiration: Similarity space analysis method may inspire subsequent research
  3. Practical Guidance: Provides theoretical basis for selecting appropriate supervised proxies

Applicable Scenarios

  • Research requiring understanding of relationships between self-supervised and supervised learning
  • Theoretical analysis of contrastive learning methods
  • Representation learning stability studies

References

  1. Luthra et al. (2025): Self-supervised contrastive learning is approximately supervised contrastive learning
  2. Chen et al. (2020): A simple framework for contrastive learning of visual representations (SimCLR)
  3. Khosla et al. (2020): Supervised contrastive learning
  4. Kornblith et al. (2019): Similarity of neural network representations revisited (CKA)
  5. Kriegeskorte et al. (2008): Representational similarity analysis

Summary: This paper theoretically establishes deep connections between self-supervised contrastive learning and supervised learning, rigorously proving representation-level alignment through mathematical analysis, providing important insights into understanding the success mechanisms of self-supervised learning. Although the practical utility of theoretical bounds is limited, its methodological innovations and experimental validation make significant contributions to theoretical development in this field.