2025-11-18T01:52:13.530679

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Yao, Huang, Wu et al.

In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search'' for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code will be available at https://github.com/HJYao00/Mulberry

academic

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Basic Information

Paper ID: 2412.18319
Title: Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Authors: Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, Dacheng Tao
Categories: cs.CV cs.AI
Publication Date: December 31, 2024 (arXiv v2)
Paper Link: https://arxiv.org/abs/2412.18319

Abstract

This research aims to develop a multimodal large language model (MLLM) capable of understanding and solving problems by learning to create each intermediate step in the reasoning process. To this end, the authors propose Collective Monte Carlo Tree Search (CoMCTS), a novel MLLM reasoning learning method that introduces collective learning concepts into "tree search," enabling effective and efficient reasoning path search and learning. The core idea of CoMCTS is to leverage the collective knowledge of multiple models through four iterative operations—expansion, simulation and error positioning, backpropagation, and selection—to collaboratively conjecture, search, and identify effective reasoning paths leading to correct answers. Based on CoMCTS, the authors construct the Mulberry-260k dataset and train the Mulberry model series with o1-like step-by-step reasoning and reflection capabilities.

Research Background and Motivation

Problem Definition

Current multimodal large language models (MLLMs) exhibit significantly increased failure rates when handling complex reasoning tasks. Existing MLLMs primarily adopt a "direct prediction" paradigm, generating brief final answers to questions while lacking explicit and well-defined intermediate reasoning steps.

Significance

As Feynman stated: "What I cannot create, I do not understand." MLLMs should be capable of creating and deeply understanding each step in the reasoning process, which is crucial for solving complex tasks.

Limitations of Existing Methods

Search Effectiveness Problem: Traditional MCTS methods operate through self-guidance, but current MLLMs lack explicit intermediate reasoning step training, making them prone to homogeneous low-quality nodes within a single MLLM's reasoning space.
Search Efficiency Problem: Traditional MCTS extends only one subsequent reasoning node per search iteration, requiring numerous iterations, which is inefficient for computationally intensive MLLMs.

Research Motivation

Inspired by recent advances such as OpenAI's o1, the authors sought to apply "tree search" methods to MLLMs. However, direct application proved ineffective, necessitating the design of new collective learning mechanisms to address search challenges.

Core Contributions

Proposes CoMCTS Method: First introduces collective learning concepts into MCTS, leveraging collective knowledge to collaboratively conjecture, search, and identify effective and reflective reasoning paths for MLLMs.
Constructs Mulberry-260k Dataset: Provides valuable resources for advancing step-by-step reasoning and reflection research in MLLMs.
Develops Mulberry Model Series: MLLMs with excellent step-by-step reasoning and reflection capabilities.
Experimental Validation: Demonstrates method superiority across multiple benchmarks.

Methodology Details

Task Definition

Given a multimodal input question Q (e.g., text task instructions containing images), the goal is to generate a series of intermediate reasoning state sequences (s₁, s₂, s₃, ..., sₘ) to arrive at the correct answer.

CoMCTS Core Architecture

CoMCTS leverages the collective knowledge of a set of MLLMs {π₁, π₂, ..., πₖ} through four key operations to iteratively search reasoning paths:

(a) Expansion

Starting from the current leaf node sₖₘ, utilize multiple MLLMs in parallel to expand diverse and complementary candidate reasoning paths:

S^j_candidate ~ πⱼ(·|Q, Parent(sₖₘ), sₖₘ)

(b) Simulation and Error Positioning

Leverage collective knowledge to evaluate candidate node values, identifying and filtering erroneous reasoning nodes:

R(sʲᵢ) = (1/K) ∑ᵏₗ₌₁ πₗ(·|prompt_eval, Q, Parent(sʲᵢ), sʲᵢ)
S*_candidate = {sʲᵢ ∈ S_candidate | R(sʲᵢ) ≥ t}

(c) Backpropagation

Update the visit count N and node value V of each node in the reasoning tree from bottom to top:

V(s) ← [N(s)·V(s) + ∑_{sₗ∈Child(s)} R(sₗ)] / [N(s) + CountChild(S*_candidate, s)]
N(s) ← N(s) + CountChild(S*_candidate, s)

(d) Selection

Use Upper Confidence Bound (UCB) to select the next starting node:

sₖ*ₘ = argmax_{s∈S*_candidate} V(s) + c·√[log N(ŝ)/(1+N(s))]

Reflective Reasoning Extension

Based on the unified reasoning tree constructed by CoMCTS, identify negative sample sibling nodes and construct reflective reasoning paths:

Negative Sample Sibling Node Identification:

s_neg = argmin_{sₗ∈Sibling(s)} UCB(sₗ) - UCB(s)

Reflective Reasoning Path Construction:

Y_reflect = Replace(Y, s, (s_neg, prompt_reflect, s))

Collective Supervised Fine-tuning (CoSFT)

Train models using data searched by CoMCTS:

L_CoSFT(πₖ) = ∑_{(Q,Y)∈D} log πₖ(Y|Q)
L_CoSFT-Re(πₖ) = ∑_{(Q,Y_reflect)∈D} log πₖ(Y_reflect|Q)

Experimental Setup

Datasets

Mulberry-260k Dataset Composition:

55K mathematical data (GLLaVA, GEOS, UniGeo, etc.)
116K chart understanding data (DVQA, DocVQA, ChartQA, etc.)
41K mathematical application problem data (IconQA, TabMWP, CLEVR, etc.)
2K medical data (VQA-RAD, PMC-VQA)
17K scientific data (TQA, AI2D, ScienceQA)
24K natural world QA data (VQA-AS, A-OKVQA, etc.)

Evaluation Metrics

Evaluated on 8 widely-used challenging datasets: MathVista, MMStar, MMMU, ChartQA, DynaMath, HallBench, MM-Math, MME

Comparison Methods

Closed-source models: GPT-4o, Claude-3.5 Sonnet
Open-source models: DeepSeek-VL, InternVL2, MiniCPM-V, etc.
Reasoning models: LLaVA-CoT, LLaVA-Reasoner, Insight-V

Implementation Details

Collective learning uses 4 models: GPT-4o, Qwen2-VL-7B, LLaMA-3.2-11B-Vision-Instruct, Qwen2-VL-72B
Maximum search iterations: 20
Batch size: 128, Learning rate: 1e-5, Training epochs: 2

Experimental Results

Main Results

Comparison with Baseline Models:

Mulberry-7B shows average improvement of 4.2% over Qwen2-VL-7B
Mulberry-11B shows average improvement of 7.5% over LLaMA-3.2-11B-Vision-Instruct
Mulberry-2B shows average improvement of 5.4% over Qwen2-VL-2B
Mulberry-8B shows average improvement of 11.0% over LLaVA-NeXT-8B

Comparison with Reasoning Response Models:

On MathVista, Mulberry shows improvements of 5.7% and 6.5% over LLaVA-Reasoner-8B and Insight-V-8B respectively
On MMMU, improvements of 3.0% and 1.0% respectively

Comparison with SOTA Models: Mulberry achieves the best performance among most open-source MLLMs and approaches closed-source model performance on certain metrics.

Ablation Studies

CoMCTS Component Analysis (Table 2):

GPT-4o direct prediction: 58.2% search success rate
CoMCTS with GPT-4o only: 63.8%
Continuous performance improvement with additional models
Complete CoMCTS: 80.2% search success rate

Reflective Data Contribution (Table 3): On MathVista, incorporating reflective data improves performance by 0.8%, demonstrating complementarity between effectiveness and reflective reasoning data.

Tree Search Method Comparison

CoMCTS demonstrates significant superiority over other tree search methods:

Search success rate: 80.2% vs 66.2% (Omega-MCTS)
Average search iterations: 12.7 vs 24.3 (Omega-MCTS)

Case Analysis

Qualitative comparison shows that Mulberry generates rich, explicit, and well-defined reasoning steps, while baseline models generate relatively brief predictions prone to errors.

Multimodal Large Language Models

MLLMs have achieved significant progress in general vision-language understanding but still face challenges in complex tasks requiring deep reasoning.

Large Language Model Reasoning

Reasoning methods can be categorized into three types:

Prompt-based Methods: Such as Chain-of-Thought (CoT)
Planning-based Methods: Such as Tree/Graph-of-thought
Learning-based Methods: Such as GPT-o1, Star, Iter-MCTS, etc.

CoMCTS effectively addresses search efficiency and effectiveness problems of traditional MCTS on MLLMs through collective learning.
The Mulberry-260k dataset provides valuable resources for multimodal reasoning research.
The Mulberry model demonstrates excellent step-by-step reasoning and reflection capabilities across multiple benchmarks.

Limitations

Computational Cost: Requires multiple models participating in collective search, resulting in significant computational overhead.
Model Dependency: Search quality depends on the quality of models participating in collective learning.
Domain Adaptability: Performance in specific domains may be limited by training data distribution.

Future Directions

Explore more efficient collective learning mechanisms.
Extend to more modalities and task types.
Investigate adaptive reasoning step allocation strategies.

In-depth Evaluation

Strengths

Strong Method Innovation: First to introduce collective learning into MCTS for MLLMs, addressing key problems of traditional methods.
Comprehensive Experiments: Conducts thorough evaluation across multiple datasets and models, including ablation studies and comparative analysis.
High Practical Value: The constructed dataset and models have significant value for the community.
Complete Technical Details: Clear method description with sufficient implementation details.

Weaknesses

Computational Efficiency: While improved compared to traditional MCTS, still requires multi-model collaboration with relatively high computational costs.
Generalization Capability: Primarily validated on mathematical and chart understanding tasks; performance on other complex reasoning tasks requires further verification.
Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why collective learning is effective.

Impact

Academic Contribution: Provides new research directions for multimodal reasoning and tree search methods.
Practical Value: The Mulberry-260k dataset and models can promote development of related research.
Reproducibility: Authors commit to open-sourcing code, facilitating method dissemination.

Applicable Scenarios

Mathematical Reasoning Tasks: Particularly suitable for mathematical problems requiring multi-step reasoning.
Chart Understanding: Demonstrates excellent performance in chart analysis and data visualization comprehension.
Scientific Question Answering: Applicable to scientific problem-solving requiring step-by-step analysis.
Educational Applications: Can be used to construct educational AI systems with reasoning capabilities.

References

The paper cites extensive related work, including:

Multimodal Large Language Models: LLaVA, Qwen2-VL, etc.
Reasoning Methods: Chain-of-Thought, Tree-of-Thought, etc.
Monte Carlo Tree Search: AlphaGo, MCTS variants, etc.
Collective Learning: Co-training related work, etc.