Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.
- Paper ID: 2510.12603
- Title: Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
- Authors: Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, Liqiang Nie
- Classification: cs.CV cs.AI cs.CL
- Publication Date/Venue: arXiv January 14, 2025
- Paper Link: https://arxiv.org/abs/2510.12603
Multimodal reasoning aims to enhance the capabilities of multimodal large language models (MLLMs) by incorporating intermediate reasoning steps before arriving at final answers. The field has evolved from pure text-based reasoning to the integration of visual information, enabling reasoning processes to be conveyed through both images and text. Despite their effectiveness, current multimodal reasoning methods rely on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant reasoning latency. To address these issues, this paper introduces multimodal latent reasoning, which offers advantages in multimodal representation, reduced annotation requirements, and reasoning efficiency. Specifically, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects visual and textual information during the reasoning process within latent space. Concretely, IVT-LR represents each reasoning step by combining two implicit components: latent text (hidden states from the previous step) and latent vision (a set of selected image embeddings). We also introduce a progressive multi-stage training strategy that enables MLLMs to perform the aforementioned multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that IVT-LR achieves an average accuracy improvement of 5.45% while attaining over 5× speedup.
Current multimodal reasoning faces three core challenges:
- High Annotation Cost: Existing methods require extensive manual annotation of vision-text interleaved reasoning data
- Large Reasoning Latency: Explicit generation of lengthy reasoning steps results in slow inference speed
- Limited Representational Capacity: Explicit text-based reasoning struggles to adequately express complex multimodal information
Multimodal reasoning is a key technology for enhancing MLLM capabilities, with important applications in visual question answering (VQA), scientific question answering, and other tasks. Improving reasoning efficiency and accuracy is crucial for practical deployment.
- Text-based Reasoning Methods: Early approaches primarily performed pure text reasoning, failing to effectively leverage visual information
- Vision-Text Interleaved Reasoning: While incorporating visual information, these methods require explicit generation of intermediate steps, increasing computational overhead
- Latent Reasoning: Existing latent reasoning approaches primarily target single modalities, lacking multimodal fusion
Inspired by the success of latent reasoning in large language models, the authors argue that latent reasoning holds greater potential in multimodal scenarios:
- Multimodal Representation Potential: Latent space can better represent rich multimodal information
- Reduced Annotation Requirements: Decreases dependence on explicit vision-text interleaved data
- Reasoning Efficiency: Avoids generating lengthy explicit reasoning chains
- First Fully Multimodal Latent Reasoning Framework: Proposes IVT-LR, enabling joint reasoning of textual and visual information in latent space
- Novel Training Paradigm: Introduces a progressive multi-stage training strategy that is both data-efficient and computationally efficient
- Significant Performance Gains: Achieves new state-of-the-art results in both accuracy and reasoning efficiency
- In-depth Mechanism Analysis: Reveals the intrinsic mechanisms of latent reasoning through attention analysis
Given a text sequence X=(x1,...,xI) and a set of visual embeddings Z=(z1,...,zJ), a standard VLM predicts the conditional distribution of the next token:
M(xt+1∣x1:t,Z)=softmax(W⋅etfused)
where etfused=f(e1:ttext,Z) is the hidden state after fusing textual and visual features.
The core of IVT-LR is reasoning in latent space, where each reasoning step comprises two components:
- Latent Text: Uses the hidden state ht−1hidden from the previous step to replace explicit text tokens
- Latent Vision: Selects k most relevant image embeddings based on attention scores
Specifically, the input at step t is:
Et=[e1,...,eN,h1latent,z1selected,...,ht−1latent,zt−1selected]
Employs attention mechanism to dynamically select critical visual features:
- Computes the sum of attention weights across all layers
- Selects k image embedding positions with the highest cumulative scores
- Concatenates selected features with hidden states
Training proceeds through N stages:
- Stage 0: Standard CoT supervision with all reasoning steps generated explicitly
- Stages 1-N: Progressively replace explicit steps with latent reasoning, starting from the first step
Training loss is computed only for remaining explicit steps and final answers, avoiding over-alignment of latent representations with explicit reasoning.
Achieves dynamic selection of key visual regions through:
- Avoiding computational overhead of full-image processing
- Focusing on task-relevant visual information
- Supporting progressive visual understanding
- M3CoT: Large-scale multimodal chain-of-thought reasoning benchmark covering science, commonsense, mathematics, and other domains
- ScienceQA: Diverse science question-answering dataset including natural science, language science, and social science
- Accuracy: Exact match answer accuracy
- Autoregressive Steps: Number of tokens required to generate answers
- Average Response Time: Reasoning latency per question
- Text Reasoning: CCoT
- Vision-Text Reasoning: Chain-of-Focus, SCAFFOLD, ICoT, Multimodal-CoT
- No Reasoning Baseline: No-CoT
- Backbone Models: Qwen2-VL-7B and Chameleon-7B
- Number of Training Stages: N=4 (3 reasoning steps)
- Batch Size: 4
- Learning Rate: 4×10^-5
- Hardware: 4 NVIDIA A6000 GPUs
| Backbone Model | Method | M3CoT Accuracy (%) | ScienceQA Accuracy (%) | Autoregressive Steps | Avg Time (s) |
|---|
| Qwen2-VL | Chain-of-Focus | 64.3 | 91.2 | 185.7 | 2.63 |
| Qwen2-VL | IVT-LR | 71.8 | 94.6 | 10.0 | 0.65 |
| Chameleon | Chain-of-Focus | 36.5 | 61.2 | 739.4 | 3.09 |
| Chameleon | IVT-LR | 41.8 | 64.0 | 10.0 | 1.13 |
- Accuracy Improvement: Compared to the strongest baseline Chain-of-Focus, achieves 5-7.5% improvement on M3CoT
- Significant Efficiency Gains: Reduces autoregressive steps by at least 9×, achieving 3-8× speedup in inference time
- Cross-Model Consistency: Demonstrates significant improvements across different backbone models
| Variant | M3CoT | ScienceQA |
|---|
| IVT-LR | 71.83 | 94.1 |
| w/o Latent Text | 52.20 (-19.63) | 84.7 (-9.8) |
| w/o Latent Vision | 46.64 (-25.19) | 82.3 (-11.8) |
| w/o Entire Latent Component | 58.02 (-13.81) | 86.4 (-7.7) |
Key Findings:
- Latent vision contributes most significantly (-25.19%)
- Latent text also plays an important role (-19.63%)
- Both components work synergistically for optimal performance
Accuracy steadily improves with increasing latent vision length per step, indicating that longer latent vision sequences provide richer visual cues.
| Latent Stage | Science | Commonsense | Mathematics | Overall |
|---|
| 1 | 56.66% | 64.40% | 38.59% | 56.30% |
| 2 | 61.71% | 70.11% | 43.57% | 61.48% |
| 3 | 70.90% | 79.78% | 63.07% | 71.83% |
Science and mathematics domains benefit most, indicating that structured reasoning tasks are particularly well-suited for latent space reasoning.
- Dynamic Attention Distribution: In latent reasoning mode, attention gradually shifts from vision to text
- Enhanced Attention Focus: Attention becomes increasingly concentrated during reasoning steps, similar to human problem-solving processes
- Text-based Reasoning: Converts visual information to text descriptions before reasoning
- Vision-Text Interleaved Reasoning: Simultaneously uses images and text during the reasoning process
- Special Token Methods: Uses tokens like , to guide reasoning
- Continuous Hidden State Methods: Directly uses hidden states for reasoning
- Multimodal Extensions: Extends latent reasoning to visual domains
- IVT-LR implements the first fully multimodal latent reasoning framework
- Significantly outperforms existing methods in both accuracy and efficiency
- Latent space reasoning provides a new solution paradigm for multimodal tasks
- Fixed Token Overhead: Each step requires additional latent vision tokens
- Training Complexity: Requires specialized multi-stage training strategy
- Fixed Number of Stages: Currently uses a fixed number of reasoning steps
- Adaptive Reasoning Steps: Dynamically determine reasoning steps based on problem complexity
- Broader Applications: Extend to planning and decision-making in sequential multimodal tasks
- More Efficient Visual Selection: Develop more fine-grained visual attention mechanisms
- Strong Novelty: First implementation of fully multimodal latent reasoning with novel technical approach
- Comprehensive Experiments: Validation across multiple datasets and backbone models with thorough ablation studies
- Significant Results: Substantial improvements in both accuracy and efficiency
- In-depth Analysis: Reveals intrinsic mechanisms through attention analysis
- Limited Applicability: Primarily targets VQA tasks; applicability to other multimodal tasks remains to be verified
- Increased Computational Complexity: Multi-stage training adds training complexity
- Limited Interpretability: Latent reasoning process lacks explicit interpretation, reducing interpretability
- Academic Value: Provides new research direction for multimodal reasoning
- Practical Value: Significant efficiency improvements are important for practical deployment
- Reproducibility: Provides detailed implementation details and code
- Resource-Constrained Environments: Mobile or edge computing scenarios requiring efficient inference
- Real-time Applications: Interactive systems with strict inference speed requirements
- Large-scale Deployment: Online services requiring processing of large request volumes
- Wei et al. (2022): Chain-of-thought prompting elicits reasoning in large language models
- Hao et al. (2024): Training large language models to reason in a continuous latent space
- Zhang et al. (2024): Multimodal chain-of-thought reasoning in language models
- Chen et al. (2024): M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought
Overall Assessment: The IVT-LR method proposed in this paper has significant innovative value in the multimodal reasoning field. Through clever latent space design and progressive training strategy, it substantially improves reasoning efficiency while maintaining high accuracy. Despite certain limitations, it provides valuable new insights for the development of this field.