Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
This paper investigates the fundamental reasons underlying the superiority of embedding methods based on multimodal large language models (MLLMs), discovering that their key advantage stems from implicit cross-modal alignment achieved during generative pretraining. The authors propose a language-centric omnimodal embedding framework (LCO-EMB) and discover the Generation-Representation Scaling Law (GRSL), which demonstrates a positive correlation between representational capacity acquired through contrastive learning and the generative capability of MLLMs. This work achieves state-of-the-art performance across multiple benchmarks and provides theoretical explanations.
Traditional cross-modal representation alignment primarily relies on large-scale contrastive learning, such as CLIP-style models. However, these methods exhibit performance plateaus on complex tasks, particularly those requiring deep cross-modal understanding, such as multilingual image retrieval, vision-language representation, and interleaved multimodal encoding.
The authors discover that MLLMs achieve implicit cross-modal alignment during generative pretraining, where the language decoder learns to leverage multimodal signals in a shared representation space to generate unimodal outputs.
Theoretical Discovery: Through anisotropy and kernel similarity structure analysis, empirically confirming the existence of latent cross-modal alignment in MLLM representations
Methodological Innovation: Proposing the language-centric omnimodal embedding framework (LCO-EMB), with contrastive learning serving as a lightweight refinement stage
Scaling Law: Discovering the Generation-Representation Scaling Law (GRSL), establishing a positive correlation between generative and representational capacity
Theoretical Support: Providing theoretical justification for GRSL through PAC-Bayesian generalization bounds
Experimental Validation: Achieving SOTA performance across multiple benchmarks and validating theory on low-resource visual document retrieval tasks
Experiments reveal that anisotropy of non-text modalities improves even after text-only contrastive learning, demonstrating the existence of latent cross-modal alignment in MLLMs.
Data Efficiency: LCO-EMB achieves SOTA performance using only ~0.37M training pairs (21× fewer than GME)
Cross-Modal Generalization: Text-only variants outperform advanced baselines on multimodal tasks
Consistent Improvement: Excellent performance across all task categories, particularly in multilingual alignment, compositionality, and document understanding
Existing methods primarily rely on training modality-specific encoders with large-scale cross-modal paired data, such as ImageBind. This work explores a novel paradigm leveraging latent alignment in MLLMs.
The paper cites 85 relevant references spanning multiple research domains including multimodal learning, contrastive learning, and large language models, providing solid theoretical foundations for the research.
Summary: Through in-depth analysis of latent cross-modal alignment capabilities in MLLMs, this paper proposes an efficient language-centric omnimodal embedding framework and discovers the theoretically significant Generation-Representation Scaling Law. The work not only achieves excellent performance across multiple benchmarks but, more importantly, provides new theoretical insights and practical paradigms for multimodal representation learning.