2025-11-23T22:10:17.101458

Scaling Language-Centric Omnimodal Representation Learning

Xiao, Chan, Zhang et al.

Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

academic

Scaling Language-Centric Omnimodal Representation Learning

Basic Information

Paper ID: 2510.11693
Title: Scaling Language-Centric Omnimodal Representation Learning
Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong (DAMO Academy, Alibaba Group)
Classification: cs.CL cs.AI cs.CV
Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2510.11693
Code Link: https://github.com/LCO-Embedding/LCO-Embedding

Abstract

This paper investigates the fundamental reasons underlying the superiority of embedding methods based on multimodal large language models (MLLMs), discovering that their key advantage stems from implicit cross-modal alignment achieved during generative pretraining. The authors propose a language-centric omnimodal embedding framework (LCO-EMB) and discover the Generation-Representation Scaling Law (GRSL), which demonstrates a positive correlation between representational capacity acquired through contrastive learning and the generative capability of MLLMs. This work achieves state-of-the-art performance across multiple benchmarks and provides theoretical explanations.

Research Background and Motivation

Problem Background

Traditional cross-modal representation alignment primarily relies on large-scale contrastive learning, such as CLIP-style models. However, these methods exhibit performance plateaus on complex tasks, particularly those requiring deep cross-modal understanding, such as multilingual image retrieval, vision-language representation, and interleaved multimodal encoding.

Research Motivation

Performance Bottleneck: CLIP-style models have reached performance plateaus through scaling model size, dataset volume, and batch size
Theoretical Gap: While MLLM-based embedding methods demonstrate superior performance, the fundamental reasons for their advantages remain unexplored
Efficiency Concerns: Traditional contrastive learning requires extensive cross-modal paired data with high computational costs

Key Insights

The authors discover that MLLMs achieve implicit cross-modal alignment during generative pretraining, where the language decoder learns to leverage multimodal signals in a shared representation space to generate unimodal outputs.

Core Contributions

Theoretical Discovery: Through anisotropy and kernel similarity structure analysis, empirically confirming the existence of latent cross-modal alignment in MLLM representations
Methodological Innovation: Proposing the language-centric omnimodal embedding framework (LCO-EMB), with contrastive learning serving as a lightweight refinement stage
Scaling Law: Discovering the Generation-Representation Scaling Law (GRSL), establishing a positive correlation between generative and representational capacity
Theoretical Support: Providing theoretical justification for GRSL through PAC-Bayesian generalization bounds
Experimental Validation: Achieving SOTA performance across multiple benchmarks and validating theory on low-resource visual document retrieval tasks

Methodology Details

Anisotropy Analysis

The authors employ anisotropy to measure the degeneracy of embedding space:

$\text{Anisotropy} := E_{h_i,h_j \sim D}[\cos(\theta_{ij})] = E_{h_i,h_j \sim D}\left[\frac{h_i^T h_j}{\|h_i\| \|h_j\|}\right]$

Experiments reveal that anisotropy of non-text modalities improves even after text-only contrastive learning, demonstrating the existence of latent cross-modal alignment in MLLMs.

Kernel Similarity Analysis

Employing mutual k-nearest neighbors (mutual kNN) to quantify the overlap of similarity structures across modalities:

$m_{NN}(\phi_i, \psi_i) = \frac{1}{k}|S(\phi_i) \cap S(\psi_i)|$

where $S(\phi_i)$ and $S(\psi_i)$ are the k-nearest neighbor sets of features $\phi_i$ and $\psi_i$ , respectively.

LCO-EMB Framework

Architecture Design

LCO-EMB is built upon standard MLLM architecture:

Modality-Specific Encoders: Processing different modality inputs
Projectors: Aligning modality-specific representations to decoder embedding space
Language Decoder: LLM as the core component

Training Strategy

Text-Only Variant: Fine-tuning language decoder exclusively with LoRA, freezing other parameters
Multimodal Variant: Incorporating limited multimodal paired data on top of text training
Parameter Efficiency: Using LoRA to maintain minimal perturbation to pretrained models

Data Configuration

all-NLI: Combining MNLI and SNLI, approximately 276k triplets
Scale-1M: 1M sentence pairs sampled from 20M multilingual parallel corpora
Multimodal Data: Approximately 94k synthetic multimodal samples

Generation-Representation Scaling Law (GRSL)

Theoretical Framework

Defining the quality of generative prior: $I_P(X;Y) := I_{\theta_0}(X;Y) \approx H(Y) - L_g(P)$

where $L_g(P)$ is the generative loss and $H(Y)$ is the entropy of target data.

Main Theorem

Theorem 1: Under Assumption 1, with probability at least $1-\delta$ , the expected population contrastive risk is bounded by:

$E_{\theta \sim Q}[L_{pop}^c(\theta)] \leq \log N - I_P(X;Y) + \epsilon_P + \sqrt{\frac{KL(Q\|P) + \log(1/\delta)}{2n}}$

This indicates that generative capacity directly determines the upper bound of representational performance.

Experimental Setup

Datasets

MIEB-Lite: 51 tasks covering 8 categories of image-text embedding evaluation
Audio-Text: AudioCaps and Clotho datasets
Video-Text: MSR-VTT and ActivityNet datasets
SeaDoc: Newly constructed low-resource Southeast Asian language visual document retrieval benchmark

Model Configuration

Backbone Models: LLaVA-Next, Qwen2.5-VL, Qwen2.5-Omni
Optimizer: AdamW with cosine learning rate scheduling
LoRA Settings: rank=64, α=16 (text)/128 (multimodal)
Batch Size: 768 (adjustable based on dataset proportions)

Evaluation Metrics

Retrieval Tasks: nDCG@5/10, Recall@1
Classification Tasks: Accuracy
Similarity Tasks: Spearman correlation coefficient
Clustering Tasks: Normalized Mutual Information (NMI)

Experimental Results

Main Results

MIEB-Lite Benchmark

LCO-EMB achieves significant performance improvements on the MIEB-Lite benchmark with 51 tasks:

Model	Dataset Scale	Avg Performance (47 tasks)	Avg Performance (51 tasks)
CLIP-ViT-bigG	2B	56.5	51.3
SigLIP-so400m	9B	57.3	53.5
Voyage Multimodal 3	-	57.7	58.1
mmE5 (11B)	2.1M	57.7	61.8
GME (7B)	8.0M	63.4	64.5
LCO-EMB-VL (7B)	370k	66.2	67.6
LCO-EMB-Omni (7B)	370k	67.6	68.8

Key Findings

Data Efficiency: LCO-EMB achieves SOTA performance using only ~0.37M training pairs (21× fewer than GME)
Cross-Modal Generalization: Text-only variants outperform advanced baselines on multimodal tasks
Consistent Improvement: Excellent performance across all task categories, particularly in multilingual alignment, compositionality, and document understanding

Ablation Studies

Training Strategy Comparison

Training Strategy	Training Time	Multilingual Image Retrieval	Visual STS	Document Understanding	Linear Probing	Average
CLIP-style CL	~550 hours	18.24	73.92	44.89	38.93	50.02
Linear Projection	~8.8 hours	40.29	72.05	35.69	52.96	56.22
Full Fine-tuning	~17.3 hours	44.05	83.15	58.02	53.34	66.49
LoRA	~9.3 hours	56.64	85.05	67.49	53.91	71.98

Dataset Impact

all-NLI Training: Excels in visual STS and document understanding
Scale-1M Training: Leads in linear probing and multilingual image retrieval
Model Fusion: Combining both training data advantages yields optimal overall performance

Generation-Representation Scaling Law Verification

Positive correlation between generative and representational capacity observed across OCR-related, video-text, and audio-text task categories:

OCR Tasks: Generative performance 65-80, representational performance 66-74
Video-Text: Generative performance 66-72, retrieval performance 38-46
Audio-Text: Generative performance 65-71, retrieval performance 23.6-24.3

SeaDoc Verification

On low-resource Southeast Asian language visual document retrieval:

Baseline model: nDCG@10 = 24.2
After continued generative training: nDCG@10 = 35.8 (+47.5% improvement)

Omnimodal Representation Learning

Existing methods primarily rely on training modality-specific encoders with large-scale cross-modal paired data, such as ImageBind. This work explores a novel paradigm leveraging latent alignment in MLLMs.

Modality-Centric Representation Learning

Vision-Centric: DINOv2 achieves CLIP-comparable OCR performance through data scaling
Language-Centric: E5-V leverages pure text learning to generalize to image and compositional retrieval tasks

Representational Capacity Research

MIEB benchmarks show CLIP performance plateaus, making MLLM-based embedding models promising alternatives.

Conclusions and Discussion

Main Conclusions

Theoretical Contribution: Discovering and validating implicit cross-modal alignment in MLLMs
Methodological Innovation: Proposing an efficient language-centric omnimodal embedding framework
Scaling Law: Establishing theoretical connections between generative and representational capacity
Practical Application: Achieving SOTA across multiple benchmarks, validating method effectiveness

Limitations

Computational Cost: While more efficient than traditional methods, still requires MLLMs as backbone networks
Joint Training: Joint training of generative and contrastive losses unexplored due to computational constraints
Theoretical Assumptions: GRSL theoretical analysis relies on specific assumptions requiring broader validation

Future Directions

Joint Optimization: Exploring joint training strategies for generative and contrastive losses
Theoretical Extension: Further refining the GRSL theoretical framework
Application Expansion: Extending methods to additional modalities and task scenarios

In-Depth Evaluation

Strengths

Theoretical Depth: Provides deep understanding of MLLM embedding method superiority
Methodological Innovation: Language-centric training paradigm demonstrates strong novelty
Comprehensive Experiments: Extensive experimental validation across multiple modalities and benchmarks
Theoretical Support: PAC-Bayesian framework provides rigorous theoretical foundation for GRSL
Practical Value: Significant data efficiency improvements with important practical implications

Weaknesses

Assumption Dependency: Theoretical analysis relies on specific assumption conditions
Computational Resources: Still requires large-scale MLLMs as foundation, demanding high computational resources
Generalization Capacity: Limited improvements on certain traditionally strong tasks (clustering, linear probing)

Impact

Academic Contribution: Provides new theoretical perspectives for multimodal representation learning
Practical Value: Significantly improves training efficiency and reduces data requirements
Reproducibility: Complete code and resources provided for easy reproduction and extension

Applicable Scenarios

Resource-Constrained Environments: Suitable for scenarios with limited data or computational resources
Multilingual Applications: Outstanding performance on multilingual multimodal tasks
Document Understanding: Significant advantages in visual document understanding tasks

References

The paper cites 85 relevant references spanning multiple research domains including multimodal learning, contrastive learning, and large language models, providing solid theoretical foundations for the research.

Summary: Through in-depth analysis of latent cross-modal alignment capabilities in MLLMs, this paper proposes an efficient language-centric omnimodal embedding framework and discovers the theoretically significant Generation-Representation Scaling Law. The work not only achieves excellent performance across multiple benchmarks but, more importantly, provides new theoretical insights and practical paradigms for multimodal representation learning.