On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Luthra, Mishra, Galanti
Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?}
We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time.
Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].
academic
On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Self-supervised contrastive learning (CL) has achieved remarkable empirical success, typically producing representations comparable to supervised pretraining. Recent theoretical work has explained this phenomenon, demonstrating that as the number of classes grows, the CL loss tightly approximates a supervised proxy—negative-sample-only supervised contrastive learning (NSCL) loss. However, this loss-level similarity leaves an open question: do CL and NSCL remain aligned at the representation level throughout training, not merely in the objective function?
This paper addresses this question by analyzing representation alignment between CL and NSCL models trained under shared randomness (identical initialization, batches, and data augmentation). The study demonstrates that their induced representations remain similar: specifically, it proves that under realistic conditions, the similarity matrices of CL and NSCL stay close. The bounds provide high-probability guarantees for alignment metrics such as Centered Kernel Alignment (CKA) and Representational Similarity Analysis (RSA), and clarify how alignment improves with more classes, higher temperature, and its dependence on batch size.
The core problem addressed in this paper is: Do self-supervised contrastive learning (CL) and negative-sample-only supervised contrastive learning (NSCL) remain aligned at the representation level during training?
Gap Between Empirical Success and Theoretical Explanation: While CL performs excellently in practice, why it learns features aligned with semantic class boundaries remains mysterious
Insufficiency of Loss-Level Similarity: Prior work (Luthra et al., 2025) only proved CL and NSCL similarity at the loss function level, which cannot guarantee consistency of optimization trajectories
Importance of Representation Alignment: Loss-level similarity cannot ensure that parameters and representations remain coupled during training, as they may diverge due to differences in curvature, gradient noise, or learning rate scheduling
Mutual Information Maximization Perspective: Early theory linked CL to inter-view mutual information maximization, but excessive constraints reduce downstream performance
Alignment and Uniformity: While geometrically intuitive, these criteria cannot fully explain how different semantic classes are organized under CL training
Cluster Recovery Theory: Most results rely on restrictive assumptions, such as conditional independence of augmentations given cluster identity
Let Σt∈[−1,1]N×N be the pairwise similarity matrix of a fixed reference set at step t. Analyze coupled evolution of CL and NSCL similarities:
ΣtCL,ΣtNSCL∈[−1,1]N×N
Luthra et al. (2025): Self-supervised contrastive learning is approximately supervised contrastive learning
Chen et al. (2020): A simple framework for contrastive learning of visual representations (SimCLR)
Khosla et al. (2020): Supervised contrastive learning
Kornblith et al. (2019): Similarity of neural network representations revisited (CKA)
Kriegeskorte et al. (2008): Representational similarity analysis
Summary: This paper theoretically establishes deep connections between self-supervised contrastive learning and supervised learning, rigorously proving representation-level alignment through mathematical analysis, providing important insights into understanding the success mechanisms of self-supervised learning. Although the practical utility of theoretical bounds is limited, its methodological innovations and experimental validation make significant contributions to theoretical development in this field.