2025-11-19T16:19:13.919719

Sparsely Multimodal Data Fusion

Bjorgaard

Multimodal data fusion is essential for applications requiring the integration of diverse data sources, especially in the presence of incomplete or sparsely available modalities. This paper presents a comparative study of three multimodal embedding techniques, Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO), to evaluate their performance on sparsely multimodal data. MCA introduces fusion embeddings for all combinations of input modalities and uses attention masking to create distinct attention channels, enabling flexible and efficient data fusion. Experiments on two datasets with four modalities each, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro across ranking, recall, regression, and classification tasks and outperforms EAO across regression and classification tasks. MCA achieves superior performance by maintaining robust uniformity across unimodal and fusion embeddings. While EAO performs best in ranking metrics due to its approach of forming fusion embeddings post-inference, it underperforms in downstream tasks requiring multimodal interactions. These results highlight the importance of contrasting all modality combinations in constructing embedding spaces and offers insights into the design of multimodal architectures for real-world applications with incomplete data.

academic

Sparsely Multimodal Data Fusion

Basic Information

Paper ID: 2403.20280
Title: Sparsely Multimodal Data Fusion
Author: Josiah A. Bjorgaard (Syntensor, Inc.)
Classification: cs.LG cs.AI
Publication Date: March 2024 (arXiv v2: January 2025)
Paper Link: https://arxiv.org/abs/2403.20280

Abstract

This paper investigates the problem of sparsely multimodal data fusion and proposes the Modal Channel Attention (MCA) method, conducting systematic comparisons with two existing approaches: Zorro and Everything at Once (EAO). MCA achieves flexible and efficient data fusion by creating fused embeddings for all modality combinations and using attention masks to create distinct attention channels. Experiments on two four-modal datasets, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro on ranking, recall, regression, and classification tasks, and surpasses EAO on regression and classification tasks.

Research Background and Motivation

Problem Definition

With the development of multimodal deep learning, real-world applications frequently face the challenge of modal-incomplete data. When datasets contain three or more modalities, samples with missing modalities become more prevalent, forming sparsely multimodal datasets.

Research Significance

Practical Demand: Multi-sensor fusion, bioinformatics, home monitoring systems, and other domains frequently encounter multimodal data missing problems
Technical Challenges: Existing multimodal fusion models often fail to effectively handle samples with incomplete modalities
Application Value: Improving model robustness and practicality in real-world scenarios

Limitations of Existing Methods

FLAVA and similar methods can handle missing modalities but cannot generate multimodal fused embedding spaces
EAO requires multiple forward passes, resulting in low computational efficiency
Zorro uses only a single fusion channel, failing to fully exploit information from different modality combinations

Core Contributions

Proposed MCA Method: Introduces modal channel attention mechanism to create fused embeddings for all possible modality combinations
Systematic Comparative Study: Comprehensively evaluates MCA, Zorro, and EAO on sparsely multimodal data
Performance Improvement: MCA outperforms existing methods on most tasks, particularly excelling in downstream tasks
Theoretical Insights: Reveals the importance of contrasting all modality combinations in constructing embedding spaces

Methodology Details

Task Definition

Input: Dataset containing 4 modalities with varying degrees of modal sparsity (0-0.8) Output: Unified fused embedding space supporting retrieval and downstream tasks Constraints: Handle incomplete modality samples while maintaining computational efficiency

Model Architecture

MCA Core Design

Fused Embedding Generation: Creates fused embeddings for all possible modality combinations (as shown in Figure 3a)
Modal Channel Attention Masks: Uses block attention masks to create distinct attention channels (as shown in Figure 3b)
Single Forward Pass: Processes all modality combinations in a single forward pass

Attention Mask Design

For four-modal datasets, MCA creates 11 attention channels:

4 unimodal channels: (1), (2), (3), (4)
6 bimodal channels: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
1 full-modal channel: (1,2,3,4)

Loss Function Strategy

Employs sample and loss masking strategies:

Missing modalities are replaced with padding tokens
Loss is computed for corresponding fused tokens as long as at least one modality is present
Uses Noise Contrastive Estimation (NCE) loss

Technical Innovations

Multi-Channel Fusion: Unlike Zorro's single channel, MCA supports fusion of all modality combinations
Computational Efficiency: Requires only one forward pass compared to EAO's multiple passes
Flexibility: Can handle missing modalities in arbitrary combinations
Unified Framework: Enables fair comparison of all three methods within a single framework

Experimental Setup

Datasets

CMU-MOSEI

Scale: 23,248 samples, 2,324 test samples
Modalities: 4 preprocessed modalities (Glove vectors, OpenFace, COVAREP, FACET encoders)
Task: Sentiment analysis regression (0-1 range)
Preprocessing: Linear layer transformation + layer normalization + positional embeddings

TCGA (The Cancer Genome Atlas)

Scale: 7,017 samples, 707 test samples
Modalities: Gene expression (800 genes), protein arrays (198 proteins), DNA methylation (800 sites), miRNA (662)
Task: 32-class cancer type classification
Preprocessing: 2-layer MLP encoding + learnable embeddings

$S = \frac{1}{N_S}\sum_{i=1}^{N_S} M_i/M_T$

where $N_S$ is the number of samples, $M_i$ is the number of modalities for sample $i$ , and $M_T$ is the total number of modalities. Experiments set S = 0, 0.2, 0.4, 0.6, 0.8.

Evaluation Metrics

Embedding Quality Metrics

Alignment: $L_a = E_{x,y}[||f(x)-f(y)||_2^2]$
Uniformity: $L_u = E_{x,y}[e^{-2||f(x)-f(y)||_2^2}]$

Retrieval Task Metrics

Median Ranking: Median rank of correct matches
Recall Rate: R@1, R@5, R@10

Downstream Task Metrics

Regression: Correlation coefficient (CMU-MOSEI)
Classification: Average AUPR (TCGA)

Implementation Details

Model Parameters: Hidden size 512, 8 attention heads, 4× feedforward multiplier
Training Setup: Batch size 32, learning rate 1e-4, cosine scheduling
Hardware: MCA/Zorro use 4×A10G GPUs (17GB), EAO uses 4×A100 GPUs (41GB)

Experimental Results

Main Results

Embedding Quality Analysis (Figure 4)

Uniformity: MCA maintains superior fused embedding uniformity in most cases
Alignment: EAO exhibits the best alignment but inferior uniformity
Sparsity Impact: All methods show degraded uniformity when modal sparsity exceeds 0.4

Ranking and Recall Performance (Figure 5)

EAO Optimal: Performs best on ranking metrics, benefiting from its post-inference fusion strategy
MCA Outperforms Zorro: MCA's median ranking and recall rates exceed Zorro in most cases
Dataset Differences: Differences are more pronounced on the larger CMU-MOSEI dataset

Downstream Task Performance (Figure 6)

Regression Task: MCA achieves 0.54 baseline on CMU-MOSEI sentiment analysis, outperforming Zorro and EAO
Classification Task: MCA performs best on TCGA cancer classification
Sparsity Robustness: MCA maintains relatively stable performance under high sparsity

Key Findings

Uniformity vs. Alignment Trade-off: Better uniformity benefits downstream tasks, while better alignment benefits retrieval tasks
Multi-Channel Advantages: Contrasting all modality combinations significantly improves embedding quality
Computational Efficiency: MCA substantially reduces computational cost while maintaining performance

Non-Contrastive Learning Methods

Interleaved Data Methods: Such as Flamingo, using autoregressive or masked language objectives
Late Fusion Masking: Handling incomplete modalities through masked representations

Contrastive Learning Methods

FLAVA: Multi-loss model but cannot generate fused embedding spaces
LORRETA: Predicts third modality, requires bimodal pairs

Pure Contrastive Learning Methods

EAO: Multiple forward passes with combined contrastive losses
Zorro: Block attention masks with single forward pass

Conclusions and Discussion

Main Conclusions

MCA Effectiveness: MCA achieves the best overall performance on sparsely multimodal data
Task Specificity: Different methods have respective advantages on different task types
Design Importance: Contrasting all modality combinations is crucial for constructing robust embedding spaces

Limitations

Computational Complexity: While more efficient than EAO, still more complex than single-channel methods
Hyperparameter Sensitivity: Requires careful tuning of attention channel numbers
Dataset Scale: Advantages are less pronounced on smaller datasets

Future Directions

Adaptive Channel Selection: Dynamically adjust attention channels based on data characteristics
Extended Modalities: Validate performance on more modalities (>4)
Theoretical Analysis: Deepen understanding of the theoretical relationship between uniformity and alignment

In-Depth Evaluation

Strengths

Problem Importance: Addresses a critical problem in real-world applications
Method Innovation: Cleverly combines advantages of EAO and Zorro
Experimental Comprehensiveness: Systematic comparative experiments and ablation studies
Theoretical Insights: Provides valuable embedding quality analysis

Weaknesses

Dataset Limitations: Validation on only two datasets; generalization remains to be verified
Insufficient Theoretical Analysis: Lacks theoretical explanation for method effectiveness
Incomplete Computational Analysis: Lacks detailed analysis of computational complexity for different methods

Impact

Academic Contribution: Provides new solutions for sparsely multimodal learning
Practical Value: Directly applicable to multi-sensor fusion, medical informatics, and other domains
Reproducibility: Provides detailed implementation details and hyperparameter settings

Applicable Scenarios

Multi-Sensor Systems: IoT devices, robotic perception
Medical Informatics: Multi-omics data fusion
Multimedia Retrieval: Content retrieval with incomplete modalities
Industrial Monitoring: Multi-source data fusion analysis

References

The paper cites multiple important multimodal learning works, including:

CLIP (Radford et al., 2021): Foundational work in multimodal contrastive learning
EAO (Shvetsova et al., 2022): Important method for multimodal retrieval
Zorro (Recasens et al., 2023): Masked multimodal Transformer
Wang & Isola (2020): Theoretical work on alignment and uniformity in contrastive learning

This paper makes important contributions to the field of sparsely multimodal data fusion. The proposed MCA method significantly improves performance while maintaining computational efficiency, providing an effective solution for handling incomplete multimodal data in real-world applications.