Multimodal data fusion is essential for applications requiring the integration of diverse data sources, especially in the presence of incomplete or sparsely available modalities. This paper presents a comparative study of three multimodal embedding techniques, Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO), to evaluate their performance on sparsely multimodal data. MCA introduces fusion embeddings for all combinations of input modalities and uses attention masking to create distinct attention channels, enabling flexible and efficient data fusion. Experiments on two datasets with four modalities each, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro across ranking, recall, regression, and classification tasks and outperforms EAO across regression and classification tasks. MCA achieves superior performance by maintaining robust uniformity across unimodal and fusion embeddings. While EAO performs best in ranking metrics due to its approach of forming fusion embeddings post-inference, it underperforms in downstream tasks requiring multimodal interactions. These results highlight the importance of contrasting all modality combinations in constructing embedding spaces and offers insights into the design of multimodal architectures for real-world applications with incomplete data.
- Paper ID: 2403.20280
- Title: Sparsely Multimodal Data Fusion
- Author: Josiah A. Bjorgaard (Syntensor, Inc.)
- Classification: cs.LG cs.AI
- Publication Date: March 2024 (arXiv v2: January 2025)
- Paper Link: https://arxiv.org/abs/2403.20280
This paper investigates the problem of sparsely multimodal data fusion and proposes the Modal Channel Attention (MCA) method, conducting systematic comparisons with two existing approaches: Zorro and Everything at Once (EAO). MCA achieves flexible and efficient data fusion by creating fused embeddings for all modality combinations and using attention masks to create distinct attention channels. Experiments on two four-modal datasets, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro on ranking, recall, regression, and classification tasks, and surpasses EAO on regression and classification tasks.
With the development of multimodal deep learning, real-world applications frequently face the challenge of modal-incomplete data. When datasets contain three or more modalities, samples with missing modalities become more prevalent, forming sparsely multimodal datasets.
- Practical Demand: Multi-sensor fusion, bioinformatics, home monitoring systems, and other domains frequently encounter multimodal data missing problems
- Technical Challenges: Existing multimodal fusion models often fail to effectively handle samples with incomplete modalities
- Application Value: Improving model robustness and practicality in real-world scenarios
- FLAVA and similar methods can handle missing modalities but cannot generate multimodal fused embedding spaces
- EAO requires multiple forward passes, resulting in low computational efficiency
- Zorro uses only a single fusion channel, failing to fully exploit information from different modality combinations
- Proposed MCA Method: Introduces modal channel attention mechanism to create fused embeddings for all possible modality combinations
- Systematic Comparative Study: Comprehensively evaluates MCA, Zorro, and EAO on sparsely multimodal data
- Performance Improvement: MCA outperforms existing methods on most tasks, particularly excelling in downstream tasks
- Theoretical Insights: Reveals the importance of contrasting all modality combinations in constructing embedding spaces
Input: Dataset containing 4 modalities with varying degrees of modal sparsity (0-0.8)
Output: Unified fused embedding space supporting retrieval and downstream tasks
Constraints: Handle incomplete modality samples while maintaining computational efficiency
- Fused Embedding Generation: Creates fused embeddings for all possible modality combinations (as shown in Figure 3a)
- Modal Channel Attention Masks: Uses block attention masks to create distinct attention channels (as shown in Figure 3b)
- Single Forward Pass: Processes all modality combinations in a single forward pass
For four-modal datasets, MCA creates 11 attention channels:
- 4 unimodal channels: (1), (2), (3), (4)
- 6 bimodal channels: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
- 1 full-modal channel: (1,2,3,4)
Employs sample and loss masking strategies:
- Missing modalities are replaced with padding tokens
- Loss is computed for corresponding fused tokens as long as at least one modality is present
- Uses Noise Contrastive Estimation (NCE) loss
- Multi-Channel Fusion: Unlike Zorro's single channel, MCA supports fusion of all modality combinations
- Computational Efficiency: Requires only one forward pass compared to EAO's multiple passes
- Flexibility: Can handle missing modalities in arbitrary combinations
- Unified Framework: Enables fair comparison of all three methods within a single framework
- Scale: 23,248 samples, 2,324 test samples
- Modalities: 4 preprocessed modalities (Glove vectors, OpenFace, COVAREP, FACET encoders)
- Task: Sentiment analysis regression (0-1 range)
- Preprocessing: Linear layer transformation + layer normalization + positional embeddings
- Scale: 7,017 samples, 707 test samples
- Modalities: Gene expression (800 genes), protein arrays (198 proteins), DNA methylation (800 sites), miRNA (662)
- Task: 32-class cancer type classification
- Preprocessing: 2-layer MLP encoding + learnable embeddings
S=NS1∑i=1NSMi/MT
where NS is the number of samples, Mi is the number of modalities for sample i, and MT is the total number of modalities. Experiments set S = 0, 0.2, 0.4, 0.6, 0.8.
- Alignment: La=Ex,y[∣∣f(x)−f(y)∣∣22]
- Uniformity: Lu=Ex,y[e−2∣∣f(x)−f(y)∣∣22]
- Median Ranking: Median rank of correct matches
- Recall Rate: R@1, R@5, R@10
- Regression: Correlation coefficient (CMU-MOSEI)
- Classification: Average AUPR (TCGA)
- Model Parameters: Hidden size 512, 8 attention heads, 4× feedforward multiplier
- Training Setup: Batch size 32, learning rate 1e-4, cosine scheduling
- Hardware: MCA/Zorro use 4×A10G GPUs (17GB), EAO uses 4×A100 GPUs (41GB)
- Uniformity: MCA maintains superior fused embedding uniformity in most cases
- Alignment: EAO exhibits the best alignment but inferior uniformity
- Sparsity Impact: All methods show degraded uniformity when modal sparsity exceeds 0.4
- EAO Optimal: Performs best on ranking metrics, benefiting from its post-inference fusion strategy
- MCA Outperforms Zorro: MCA's median ranking and recall rates exceed Zorro in most cases
- Dataset Differences: Differences are more pronounced on the larger CMU-MOSEI dataset
- Regression Task: MCA achieves 0.54 baseline on CMU-MOSEI sentiment analysis, outperforming Zorro and EAO
- Classification Task: MCA performs best on TCGA cancer classification
- Sparsity Robustness: MCA maintains relatively stable performance under high sparsity
- Uniformity vs. Alignment Trade-off: Better uniformity benefits downstream tasks, while better alignment benefits retrieval tasks
- Multi-Channel Advantages: Contrasting all modality combinations significantly improves embedding quality
- Computational Efficiency: MCA substantially reduces computational cost while maintaining performance
- Interleaved Data Methods: Such as Flamingo, using autoregressive or masked language objectives
- Late Fusion Masking: Handling incomplete modalities through masked representations
- FLAVA: Multi-loss model but cannot generate fused embedding spaces
- LORRETA: Predicts third modality, requires bimodal pairs
- EAO: Multiple forward passes with combined contrastive losses
- Zorro: Block attention masks with single forward pass
- MCA Effectiveness: MCA achieves the best overall performance on sparsely multimodal data
- Task Specificity: Different methods have respective advantages on different task types
- Design Importance: Contrasting all modality combinations is crucial for constructing robust embedding spaces
- Computational Complexity: While more efficient than EAO, still more complex than single-channel methods
- Hyperparameter Sensitivity: Requires careful tuning of attention channel numbers
- Dataset Scale: Advantages are less pronounced on smaller datasets
- Adaptive Channel Selection: Dynamically adjust attention channels based on data characteristics
- Extended Modalities: Validate performance on more modalities (>4)
- Theoretical Analysis: Deepen understanding of the theoretical relationship between uniformity and alignment
- Problem Importance: Addresses a critical problem in real-world applications
- Method Innovation: Cleverly combines advantages of EAO and Zorro
- Experimental Comprehensiveness: Systematic comparative experiments and ablation studies
- Theoretical Insights: Provides valuable embedding quality analysis
- Dataset Limitations: Validation on only two datasets; generalization remains to be verified
- Insufficient Theoretical Analysis: Lacks theoretical explanation for method effectiveness
- Incomplete Computational Analysis: Lacks detailed analysis of computational complexity for different methods
- Academic Contribution: Provides new solutions for sparsely multimodal learning
- Practical Value: Directly applicable to multi-sensor fusion, medical informatics, and other domains
- Reproducibility: Provides detailed implementation details and hyperparameter settings
- Multi-Sensor Systems: IoT devices, robotic perception
- Medical Informatics: Multi-omics data fusion
- Multimedia Retrieval: Content retrieval with incomplete modalities
- Industrial Monitoring: Multi-source data fusion analysis
The paper cites multiple important multimodal learning works, including:
- CLIP (Radford et al., 2021): Foundational work in multimodal contrastive learning
- EAO (Shvetsova et al., 2022): Important method for multimodal retrieval
- Zorro (Recasens et al., 2023): Masked multimodal Transformer
- Wang & Isola (2020): Theoretical work on alignment and uniformity in contrastive learning
This paper makes important contributions to the field of sparsely multimodal data fusion. The proposed MCA method significantly improves performance while maintaining computational efficiency, providing an effective solution for handling incomplete multimodal data in real-world applications.