Data or Language Supervision: What Makes CLIP Better than DINO?
Liu, Zhang, Ghosh et al.
CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.
academic
Data or Language Supervision: What Makes CLIP Better than DINO?
CLIP outperforms self-supervised models like DINO as a visual encoder in vision-language models (VLMs), yet it remains unclear whether this advantage stems from language supervision or larger-scale training data. To decouple these factors, researchers pretrained CLIP and DINO under controlled settings—using identical architecture, datasets, and training configurations—achieving comparable ImageNet accuracy. Embedding analysis reveals that CLIP captures high-level semantics (such as object categories and text), while DINO is more responsive to low-level features like color and style. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-dense tasks, while DINO shows slight advantages on vision-centric tasks. Language supervision variants (such as sigmoid loss and pretrained language encoders) yield limited improvements.
The fundamental question this study addresses is: Does CLIP's superior performance compared to DINO in vision-language models stem from language supervision or larger-scale training data?
Practical Significance: Visual encoders serve as the "eyes" of VLMs, with their performance directly impacting the entire system's visual understanding capabilities
Theoretical Value: Understanding how different supervision signals influence visual representation learning provides scientific guidance for designing better visual encoders
Confounding Factors: Existing CLIP and DINO models differ by up to 100× in training data scale, making it difficult to isolate the effects of supervision type and data scale
Lack of Controlled Experiments: Previous comparative studies relied on pretrained models with different training setups, preventing fair comparisons
Insufficient Mechanistic Understanding: Limited analysis of how language supervision reshapes visual representation spaces
Through rigorously controlled experimental design, training CLIP and DINO under identical conditions to scientifically analyze the true impact of language supervision on visual encoder performance.
First Controlled Experiment: Training CLIP and DINO with identical architecture (ViT-B/16), dataset (DataComp 10M subset), and training configurations to enable fair comparison
Embedding Space Analysis: In-depth analysis of how language supervision transforms visual representations, revealing that CLIP focuses on high-level semantics while DINO is sensitive to low-level visual features
VLM Performance Evaluation: Systematic evaluation of both encoders across 20 VQA benchmarks, showing CLIP significantly outperforms DINO on OCR tasks (7.5% improvement)
Supervision Variant Exploration: Verification of limited gains from different language supervision forms (SigLIP loss, pretrained language models)
Scientific Insights: Providing empirically-grounded design principles for visual encoder development
Input: Image dataset with optional paired text descriptions
Output: Visual encoder capable of mapping images to semantic representation space
Constraints: Modifying only the supervision signal type while controlling all other variables
This paper cites important works in vision-language models and visual representation learning, including:
CLIP (Radford et al., 2021)
DINO (Caron et al., 2021)
LLaVA (Liu et al., 2023)
SigLIP (Zhai et al., 2023)
DataComp (Gadre et al., 2023)
Overall Assessment: This is a high-quality empirical research paper that addresses an important scientific question in the field through rigorous controlled experimental design. The research methodology is scientifically sound, and the conclusions hold significant theoretical and practical value, providing valuable guidance for vision-language model development.