2025-11-23T22:10:17.101458

Scaling Language-Centric Omnimodal Representation Learning

Xiao, Chan, Zhang et al.
Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
academic

Scaling Language-Centric Omnimodal Representation Learning

Basic Information

  • Paper ID: 2510.11693
  • Title: Scaling Language-Centric Omnimodal Representation Learning
  • Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong (DAMO Academy, Alibaba Group)
  • Classification: cs.CL cs.AI cs.CV
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.11693
  • Code Link: https://github.com/LCO-Embedding/LCO-Embedding

Abstract

This paper investigates the fundamental reasons underlying the superiority of embedding methods based on multimodal large language models (MLLMs), discovering that their key advantage stems from implicit cross-modal alignment achieved during generative pretraining. The authors propose a language-centric omnimodal embedding framework (LCO-EMB) and discover the Generation-Representation Scaling Law (GRSL), which demonstrates a positive correlation between representational capacity acquired through contrastive learning and the generative capability of MLLMs. This work achieves state-of-the-art performance across multiple benchmarks and provides theoretical explanations.

Research Background and Motivation

Problem Background

Traditional cross-modal representation alignment primarily relies on large-scale contrastive learning, such as CLIP-style models. However, these methods exhibit performance plateaus on complex tasks, particularly those requiring deep cross-modal understanding, such as multilingual image retrieval, vision-language representation, and interleaved multimodal encoding.

Research Motivation

  1. Performance Bottleneck: CLIP-style models have reached performance plateaus through scaling model size, dataset volume, and batch size
  2. Theoretical Gap: While MLLM-based embedding methods demonstrate superior performance, the fundamental reasons for their advantages remain unexplored
  3. Efficiency Concerns: Traditional contrastive learning requires extensive cross-modal paired data with high computational costs

Key Insights

The authors discover that MLLMs achieve implicit cross-modal alignment during generative pretraining, where the language decoder learns to leverage multimodal signals in a shared representation space to generate unimodal outputs.

Core Contributions

  1. Theoretical Discovery: Through anisotropy and kernel similarity structure analysis, empirically confirming the existence of latent cross-modal alignment in MLLM representations
  2. Methodological Innovation: Proposing the language-centric omnimodal embedding framework (LCO-EMB), with contrastive learning serving as a lightweight refinement stage
  3. Scaling Law: Discovering the Generation-Representation Scaling Law (GRSL), establishing a positive correlation between generative and representational capacity
  4. Theoretical Support: Providing theoretical justification for GRSL through PAC-Bayesian generalization bounds
  5. Experimental Validation: Achieving SOTA performance across multiple benchmarks and validating theory on low-resource visual document retrieval tasks

Methodology Details

Latent Cross-Modal Alignment Analysis

Anisotropy Analysis

The authors employ anisotropy to measure the degeneracy of embedding space:

Anisotropy:=Ehi,hjD[cos(θij)]=Ehi,hjD[hiThjhihj]\text{Anisotropy} := E_{h_i,h_j \sim D}[\cos(\theta_{ij})] = E_{h_i,h_j \sim D}\left[\frac{h_i^T h_j}{\|h_i\| \|h_j\|}\right]

Experiments reveal that anisotropy of non-text modalities improves even after text-only contrastive learning, demonstrating the existence of latent cross-modal alignment in MLLMs.

Kernel Similarity Analysis

Employing mutual k-nearest neighbors (mutual kNN) to quantify the overlap of similarity structures across modalities:

mNN(ϕi,ψi)=1kS(ϕi)S(ψi)m_{NN}(\phi_i, \psi_i) = \frac{1}{k}|S(\phi_i) \cap S(\psi_i)|

where S(ϕi)S(\phi_i) and S(ψi)S(\psi_i) are the k-nearest neighbor sets of features ϕi\phi_i and ψi\psi_i, respectively.

LCO-EMB Framework

Architecture Design

LCO-EMB is built upon standard MLLM architecture:

  • Modality-Specific Encoders: Processing different modality inputs
  • Projectors: Aligning modality-specific representations to decoder embedding space
  • Language Decoder: LLM as the core component

Training Strategy

  1. Text-Only Variant: Fine-tuning language decoder exclusively with LoRA, freezing other parameters
  2. Multimodal Variant: Incorporating limited multimodal paired data on top of text training
  3. Parameter Efficiency: Using LoRA to maintain minimal perturbation to pretrained models

Data Configuration

  • all-NLI: Combining MNLI and SNLI, approximately 276k triplets
  • Scale-1M: 1M sentence pairs sampled from 20M multilingual parallel corpora
  • Multimodal Data: Approximately 94k synthetic multimodal samples

Generation-Representation Scaling Law (GRSL)

Theoretical Framework

Defining the quality of generative prior: IP(X;Y):=Iθ0(X;Y)H(Y)Lg(P)I_P(X;Y) := I_{\theta_0}(X;Y) \approx H(Y) - L_g(P)

where Lg(P)L_g(P) is the generative loss and H(Y)H(Y) is the entropy of target data.

Main Theorem

Theorem 1: Under Assumption 1, with probability at least 1δ1-\delta, the expected population contrastive risk is bounded by:

EθQ[Lpopc(θ)]logNIP(X;Y)+ϵP+KL(QP)+log(1/δ)2nE_{\theta \sim Q}[L_{pop}^c(\theta)] \leq \log N - I_P(X;Y) + \epsilon_P + \sqrt{\frac{KL(Q\|P) + \log(1/\delta)}{2n}}

This indicates that generative capacity directly determines the upper bound of representational performance.

Experimental Setup

Datasets

  • MIEB-Lite: 51 tasks covering 8 categories of image-text embedding evaluation
  • Audio-Text: AudioCaps and Clotho datasets
  • Video-Text: MSR-VTT and ActivityNet datasets
  • SeaDoc: Newly constructed low-resource Southeast Asian language visual document retrieval benchmark

Model Configuration

  • Backbone Models: LLaVA-Next, Qwen2.5-VL, Qwen2.5-Omni
  • Optimizer: AdamW with cosine learning rate scheduling
  • LoRA Settings: rank=64, α=16 (text)/128 (multimodal)
  • Batch Size: 768 (adjustable based on dataset proportions)

Evaluation Metrics

  • Retrieval Tasks: nDCG@5/10, Recall@1
  • Classification Tasks: Accuracy
  • Similarity Tasks: Spearman correlation coefficient
  • Clustering Tasks: Normalized Mutual Information (NMI)

Experimental Results

Main Results

MIEB-Lite Benchmark

LCO-EMB achieves significant performance improvements on the MIEB-Lite benchmark with 51 tasks:

ModelDataset ScaleAvg Performance (47 tasks)Avg Performance (51 tasks)
CLIP-ViT-bigG2B56.551.3
SigLIP-so400m9B57.353.5
Voyage Multimodal 3-57.758.1
mmE5 (11B)2.1M57.761.8
GME (7B)8.0M63.464.5
LCO-EMB-VL (7B)370k66.267.6
LCO-EMB-Omni (7B)370k67.668.8

Key Findings

  1. Data Efficiency: LCO-EMB achieves SOTA performance using only ~0.37M training pairs (21× fewer than GME)
  2. Cross-Modal Generalization: Text-only variants outperform advanced baselines on multimodal tasks
  3. Consistent Improvement: Excellent performance across all task categories, particularly in multilingual alignment, compositionality, and document understanding

Ablation Studies

Training Strategy Comparison

Training StrategyTraining TimeMultilingual Image RetrievalVisual STSDocument UnderstandingLinear ProbingAverage
CLIP-style CL~550 hours18.2473.9244.8938.9350.02
Linear Projection~8.8 hours40.2972.0535.6952.9656.22
Full Fine-tuning~17.3 hours44.0583.1558.0253.3466.49
LoRA~9.3 hours56.6485.0567.4953.9171.98

Dataset Impact

  • all-NLI Training: Excels in visual STS and document understanding
  • Scale-1M Training: Leads in linear probing and multilingual image retrieval
  • Model Fusion: Combining both training data advantages yields optimal overall performance

Generation-Representation Scaling Law Verification

Cross-Modal Verification

Positive correlation between generative and representational capacity observed across OCR-related, video-text, and audio-text task categories:

  • OCR Tasks: Generative performance 65-80, representational performance 66-74
  • Video-Text: Generative performance 66-72, retrieval performance 38-46
  • Audio-Text: Generative performance 65-71, retrieval performance 23.6-24.3

SeaDoc Verification

On low-resource Southeast Asian language visual document retrieval:

  • Baseline model: nDCG@10 = 24.2
  • After continued generative training: nDCG@10 = 35.8 (+47.5% improvement)

Omnimodal Representation Learning

Existing methods primarily rely on training modality-specific encoders with large-scale cross-modal paired data, such as ImageBind. This work explores a novel paradigm leveraging latent alignment in MLLMs.

Modality-Centric Representation Learning

  • Vision-Centric: DINOv2 achieves CLIP-comparable OCR performance through data scaling
  • Language-Centric: E5-V leverages pure text learning to generalize to image and compositional retrieval tasks

Representational Capacity Research

MIEB benchmarks show CLIP performance plateaus, making MLLM-based embedding models promising alternatives.

Conclusions and Discussion

Main Conclusions

  1. Theoretical Contribution: Discovering and validating implicit cross-modal alignment in MLLMs
  2. Methodological Innovation: Proposing an efficient language-centric omnimodal embedding framework
  3. Scaling Law: Establishing theoretical connections between generative and representational capacity
  4. Practical Application: Achieving SOTA across multiple benchmarks, validating method effectiveness

Limitations

  1. Computational Cost: While more efficient than traditional methods, still requires MLLMs as backbone networks
  2. Joint Training: Joint training of generative and contrastive losses unexplored due to computational constraints
  3. Theoretical Assumptions: GRSL theoretical analysis relies on specific assumptions requiring broader validation

Future Directions

  1. Joint Optimization: Exploring joint training strategies for generative and contrastive losses
  2. Theoretical Extension: Further refining the GRSL theoretical framework
  3. Application Expansion: Extending methods to additional modalities and task scenarios

In-Depth Evaluation

Strengths

  1. Theoretical Depth: Provides deep understanding of MLLM embedding method superiority
  2. Methodological Innovation: Language-centric training paradigm demonstrates strong novelty
  3. Comprehensive Experiments: Extensive experimental validation across multiple modalities and benchmarks
  4. Theoretical Support: PAC-Bayesian framework provides rigorous theoretical foundation for GRSL
  5. Practical Value: Significant data efficiency improvements with important practical implications

Weaknesses

  1. Assumption Dependency: Theoretical analysis relies on specific assumption conditions
  2. Computational Resources: Still requires large-scale MLLMs as foundation, demanding high computational resources
  3. Generalization Capacity: Limited improvements on certain traditionally strong tasks (clustering, linear probing)

Impact

  1. Academic Contribution: Provides new theoretical perspectives for multimodal representation learning
  2. Practical Value: Significantly improves training efficiency and reduces data requirements
  3. Reproducibility: Complete code and resources provided for easy reproduction and extension

Applicable Scenarios

  1. Resource-Constrained Environments: Suitable for scenarios with limited data or computational resources
  2. Multilingual Applications: Outstanding performance on multilingual multimodal tasks
  3. Document Understanding: Significant advantages in visual document understanding tasks

References

The paper cites 85 relevant references spanning multiple research domains including multimodal learning, contrastive learning, and large language models, providing solid theoretical foundations for the research.


Summary: Through in-depth analysis of latent cross-modal alignment capabilities in MLLMs, this paper proposes an efficient language-centric omnimodal embedding framework and discovers the theoretically significant Generation-Representation Scaling Law. The work not only achieves excellent performance across multiple benchmarks but, more importantly, provides new theoretical insights and practical paradigms for multimodal representation learning.