2025-11-16T15:07:12.519849

Data or Language Supervision: What Makes CLIP Better than DINO?

Liu, Zhang, Ghosh et al.
CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.
academic

Data or Language Supervision: What Makes CLIP Better than DINO?

Basic Information

  • Paper ID: 2510.11835
  • Title: Data or Language Supervision: What Makes CLIP Better than DINO?
  • Authors: Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy (Stanford University, Tsinghua University)
  • Categories: cs.CV cs.AI cs.CL cs.LG cs.MM
  • Publication Date: October 13, 2025
  • Paper Link: https://arxiv.org/abs/2510.11835

Abstract

CLIP outperforms self-supervised models like DINO as a visual encoder in vision-language models (VLMs), yet it remains unclear whether this advantage stems from language supervision or larger-scale training data. To decouple these factors, researchers pretrained CLIP and DINO under controlled settings—using identical architecture, datasets, and training configurations—achieving comparable ImageNet accuracy. Embedding analysis reveals that CLIP captures high-level semantics (such as object categories and text), while DINO is more responsive to low-level features like color and style. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-dense tasks, while DINO shows slight advantages on vision-centric tasks. Language supervision variants (such as sigmoid loss and pretrained language encoders) yield limited improvements.

Research Background and Motivation

Core Research Question

The fundamental question this study addresses is: Does CLIP's superior performance compared to DINO in vision-language models stem from language supervision or larger-scale training data?

Significance of the Problem

  1. Practical Significance: Visual encoders serve as the "eyes" of VLMs, with their performance directly impacting the entire system's visual understanding capabilities
  2. Theoretical Value: Understanding how different supervision signals influence visual representation learning provides scientific guidance for designing better visual encoders
  3. Resource Optimization: Clarifying key factors enables better design choices under resource constraints

Limitations of Existing Approaches

  1. Confounding Factors: Existing CLIP and DINO models differ by up to 100× in training data scale, making it difficult to isolate the effects of supervision type and data scale
  2. Lack of Controlled Experiments: Previous comparative studies relied on pretrained models with different training setups, preventing fair comparisons
  3. Insufficient Mechanistic Understanding: Limited analysis of how language supervision reshapes visual representation spaces

Research Motivation

Through rigorously controlled experimental design, training CLIP and DINO under identical conditions to scientifically analyze the true impact of language supervision on visual encoder performance.

Core Contributions

  1. First Controlled Experiment: Training CLIP and DINO with identical architecture (ViT-B/16), dataset (DataComp 10M subset), and training configurations to enable fair comparison
  2. Embedding Space Analysis: In-depth analysis of how language supervision transforms visual representations, revealing that CLIP focuses on high-level semantics while DINO is sensitive to low-level visual features
  3. VLM Performance Evaluation: Systematic evaluation of both encoders across 20 VQA benchmarks, showing CLIP significantly outperforms DINO on OCR tasks (7.5% improvement)
  4. Supervision Variant Exploration: Verification of limited gains from different language supervision forms (SigLIP loss, pretrained language models)
  5. Scientific Insights: Providing empirically-grounded design principles for visual encoder development

Methodology Details

Task Definition

Input: Image dataset with optional paired text descriptions Output: Visual encoder capable of mapping images to semantic representation space Constraints: Modifying only the supervision signal type while controlling all other variables

Controlled Experimental Design

Unified Architecture

  • Backbone Network: ViT-B/16 as the common architecture for both models
  • Parameter Scale: Ensuring consistent model complexity

Unified Dataset

  • Data Source: 10M image subset from the DataComp dataset
  • Preprocessing: Unified center cropping and 224×224 size adjustment
  • Supervision Difference: CLIP uses image-text pairs; DINO uses images only

Unified Training Configuration

  • Optimizer: AdamW
  • Learning Rate: 1e-3 with cosine decay
  • Training Epochs: 20
  • Hardware: 4 A100 GPUs, 3-day training duration

Embedding Analysis Methods

Differentiated Image Pair Identification

Defining two classes of image pairs to analyze model divergence:

g1 = (clip_sim > 0.8) ∧ (dino_sim < 0.5)  # High CLIP similarity, low DINO similarity
g2 = (dino_sim > 0.8) ∧ (clip_sim < 0.5)  # High DINO similarity, low CLIP similarity

Quantitative Validation Experiments

  1. Semantic Sensitivity Test: Using images with different letters/numbers to test semantic discrimination ability
  2. Visual Pattern Sensitivity Test: Using simple repetitive visual patterns to test low-level feature sensitivity

VLM Integration Scheme

Framework Selection

  • Base Architecture: LLaVA-1.5
  • Replacement Component: Visual encoder only
  • Training Pipeline: Pretraining + visual instruction tuning

Evaluation Benchmarks

  • VMCBench: Unified multi-choice visual question answering benchmark with 20 datasets
  • Task Types: General VQA, reasoning, document/chart understanding, OCR, etc.

Experimental Setup

Datasets

  1. Training Data: DataComp 10M subset
    • Scale: 10 million image-text pairs
    • Preprocessing: Center cropping, 224×224 resolution
  2. Evaluation Datasets:
    • Classification Tasks: ImageNet, CIFAR-10, Stanford Cars, Flowers, CUB, ImageNetV2, CIFAR-10.1
    • VQA Tasks: 20 subsets from VMCBench, including OCRVQA, TextVQA, etc.

Evaluation Metrics

  • Linear Probing Accuracy: Standard method for assessing visual encoder quality
  • VQA Accuracy: Correctness rate for multiple-choice questions
  • Cosine Similarity: Embedding space analysis metric

Comparison Methods

  • Official Models: Officially released CLIP and DINO pretrained models
  • Controlled Models: CLIP and DINO trained under identical conditions
  • Supervision Variants: SigLIP loss version, pretrained language model version

Implementation Details

  • Checkpoint Selection: Selecting best checkpoint based on validation set performance
  • Evaluation Frequency: Saving and evaluating every 500 steps
  • Statistical Significance: Verifying result stability across multiple random seeds

Experimental Results

Main Results

Classification Task Performance

ModelImageNetCIFAR-10Stanford CarsFlowersCUB
Controlled CLIP65.8%90.7%74.7%78.7%52.3%
Controlled DINO66.4%92.1%54.1%80.7%43.0%

Key Findings:

  • Comparable performance on general classification tasks
  • CLIP significantly outperforms DINO on fine-grained classification (Stanford Cars: +20.6%, CUB: +9.3%)

VLM Task Performance

Task TypeLLaVA-CLIPLLaVA-DINODifference
General VQA46.2%46.0%+0.2%
Reasoning41.2%41.5%-0.3%
Document/Chart33.2%33.1%+0.1%
OCR Tasks47.5%40.0%+7.5%

Key Findings:

  • Comparable performance on most tasks
  • CLIP significantly outperforms DINO on OCR-related tasks

Embedding Analysis Results

Quantitative Validation

  1. Semantic Content Sensitivity:
    • DINO average similarity: 0.877
    • CLIP average similarity: 0.713 (lower, indicating better semantic discrimination)
  2. Visual Pattern Sensitivity:
    • DINO average similarity: 0.478 (lower, indicating better visual detail discrimination)
    • CLIP average similarity: 0.497

Qualitative Analysis

  • CLIP Advantages: Better capture of object categories and embedded text as high-level semantics
  • DINO Advantages: Greater sensitivity to color, style, and other low-level visual features

Supervision Variant Experiments

VariantVMCBench Average Accuracy
Standard CLIP41.4%
SigLIP Loss40.8%
Pretrained Language Model40.5%

Conclusion: Different forms of language supervision yield limited improvements.

Language Model Backbone Experiments

Results using Qwen2-7B replacing Vicuna-7B:

Model CombinationGeneral VQAOCRAverage
CLIP + Qwen257.90%51.40%49.69%
DINO + Qwen254.02%47.59%47.72%

Vision-Language Models

  • Representative Works: LLaVA, Qwen2.5-VL, etc.
  • Architectural Characteristics: Visual encoder + language model + connector module
  • This Paper's Contribution: Systematic analysis focused on the visual encoder component

Visual Representation Learning

  1. Self-Supervised Methods: DINO, SimCLR, etc., learning representations through image augmentation relationship prediction
  2. Language Supervision Methods: CLIP, EVA-CLIP, SigLIP, etc., leveraging image-text alignment
  3. This Paper's Innovation: First systematic comparison of both paradigms under controlled conditions

VLM Design Choice Research

  • Existing Research: Primarily focused on architectural components, data strategies, training configurations
  • Limitations: Based on pretrained models with different training setups, lacking controlled variables
  • This Paper's Advantage: Rigorous controlled experimental design

Conclusions and Discussion

Main Conclusions

  1. Data Scale vs. Supervision Type: When controlling for data scale, language supervision indeed provides specific advantages
  2. Representation Differences: CLIP learns high-level semantic representations; DINO focuses on low-level visual features
  3. Task Specificity: CLIP shows clear advantages on text-dense tasks; both perform comparably on vision-centric tasks
  4. Supervision Forms: Different language supervision variants yield limited improvements

Limitations

  1. Data Scale Constraints: Experiments conducted only on 10M image subset; validation on billion-scale data needed
  2. Single Architecture: Only ViT-B/16 tested; conclusions may differ for other architectures
  3. Task Coverage: Primarily focused on VQA tasks; conclusions on other vision-language tasks require verification

Future Directions

  1. Large-Scale Validation: Repeating controlled experiments on billion-scale data
  2. Hybrid Methods: Exploring mixed training strategies combining self-supervision and language supervision
  3. Architecture Exploration: Verifying generalizability of conclusions across different visual architectures

In-Depth Evaluation

Strengths

  1. Rigorous Experimental Design: First truly controlled experiment eliminating confounding factors
  2. Comprehensive Analysis: Multi-level analysis from embedding space to downstream tasks
  3. High Scientific Value: Providing empirically-grounded design guidance for the field
  4. Strong Reproducibility: Detailed experimental setup and open-source code
  5. Clear Writing: Logical structure and accurate conclusion statements

Weaknesses

  1. Scale Limitations: 10M dataset relatively small; may not fully reflect large-scale training scenarios
  2. Task Limitations: Primarily focused on VQA tasks; generalization to other vision-language tasks insufficiently verified
  3. Insufficient Theoretical Analysis: Lacking theoretical explanation for why language supervision produces these differences

Impact

  1. Academic Contribution: Providing scientific foundation for visual encoder design, filling a field gap
  2. Practical Value: Guiding visual encoder selection in actual VLM systems
  3. Methodological Contribution: Controlled experimental design approach applicable to other comparative studies

Applicable Scenarios

  1. VLM Development: Providing evidence for selecting appropriate visual encoders
  2. Research Guidance: Directing visual representation learning research
  3. Resource Optimization: Enabling better design choices under resource constraints

References

This paper cites important works in vision-language models and visual representation learning, including:

  • CLIP (Radford et al., 2021)
  • DINO (Caron et al., 2021)
  • LLaVA (Liu et al., 2023)
  • SigLIP (Zhai et al., 2023)
  • DataComp (Gadre et al., 2023)

Overall Assessment: This is a high-quality empirical research paper that addresses an important scientific question in the field through rigorous controlled experimental design. The research methodology is scientifically sound, and the conclusions hold significant theoretical and practical value, providing valuable guidance for vision-language model development.