2025-11-19T08:40:14.124836

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

Ma, Li, Tang et al.

Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

academic

Basic Information

Paper ID: 2510.09358
Title: Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Authors: Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen, Yingyi Zhang, Chao Feng, Jiao Ran
Institution: ByteDance Douyin Content Group
Category: cs.CV
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09358
Code: https://github.com/bytedance/DynamicCoT

Abstract

Multi-modal keyphrase prediction (MMKP) aims to transcend pure text-based approaches by integrating multi-modal input information to generate a set of conclusive phrases. Traditional multi-modal methods exhibit significant limitations in handling absence scenarios and unseen scenarios. Furthermore, existing benchmarks overestimate model capabilities due to severe train-test set overlap. This paper proposes leveraging vision-language models (VLMs) to address the MMKP task. First, we evaluate the lower-bound performance of VLMs using both zero-shot and supervised fine-tuning (SFT) strategies. Subsequently, we employ the Fine-tune-CoT approach, utilizing high-quality chain-of-thought (CoT) reasoning data generated by teacher models to fine-tune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy that adaptively injects CoT data during training, enabling models to flexibly leverage reasoning capabilities during inference.

Research Background and Motivation

Problem Definition and Significance

The multi-modal keyphrase prediction (MMKP) task aims to generate concise, information-rich key phrases (such as hashtags) for social media content containing both text and images. This task holds significant value in applications including social media content understanding, recommendation systems, and content classification.

Limitations of Existing Methods

Constraints of Traditional Multi-modal Approaches: Existing methods such as M3H-ATT and MM-MKP primarily rely on designing cross-modal fusion architectures but perform poorly in complex scenarios, particularly:
- Absence Scenario: Predicted keyphrases do not exist in the input text, requiring robust cross-modal interaction capabilities
- Unseen Scenario: Predicted keyphrases have not appeared in the training set, demanding strong generalization ability from the model
Dataset Issues: Public MMKP datasets suffer from severe train-test overlap problems, with 97.32% of test set keyphrases appearing in the training set, whereas this ratio is only 45.28% in real production environments
Model Capacity Limitations: Traditional methods are constrained by limited model capacity and world knowledge, making it difficult to handle content involving memes, current events, and other topics requiring external knowledge

Core Contributions

First Systematic Study: To the authors' knowledge, this is the first comprehensive investigation of VLMs' potential in multi-modal keyphrase prediction tasks
Dynamic CoT Strategy: Proposes a dynamic chain-of-thought strategy enabling VLMs to adaptively select CoT reasoning for difficult unseen samples, better suited for production environments requiring efficient decoding
Dataset Reconstruction: Constructs MMKP-V2 and MMKP-360k datasets that better reflect real-world distributions
Comprehensive Experimental Validation: Conducts rigorous analysis across multiple datasets, verifying method effectiveness and robustness

Methodology Details

Task Definition

Given multi-modal input (text T and image I), the MMKP task requires generating a set of key phrases K = {k₁, k₂, ..., kₙ} that capture the core information of the input content.

Traditional Method Analysis

Traditional multi-modal models employ multi-task loss functions:

L(θ) = -∑[log P_cls(y^n) + γ · ∑log P_gen(y^n_t)]

where the first term represents classification loss and the second term represents keyphrase generation loss. This approach limits open-set generation capabilities.

VLMs Foundation Methods

1. Supervised Fine-Tuning (SFT)

Uses multi-modal content as input prompts and ground-truth keyphrases as responses, employing next-token prediction loss:

L_sft = -1/T ∑log P(y^s_t | y^s_<t, v; θ)

2. Fine-tune-CoT

Constructs multi-modal CoT data with reasoning processes generated by GPT-4o, formatted as:

<think>thinking process</think><answer>keyphrases</answer>

Loss function:

L_cot = -1/T ∑log P(y^c_t | y^c_<t, v; θ)

Core Innovation: Dynamic CoT Strategy

Motivation

Fine-tune-CoT exhibits two problems:

Overthinking Phenomenon: Generates overly generic keyphrases for simple samples
Content Redundancy: Posts with identical keyphrases receive highly similar reasoning paths

Method Design

Dynamic CoT classifies samples into simple and difficult categories based on SFT loss:

L_d = -1/T ∑log P(y^d_t | y^d_<t, v; θ)

where:

y^d = {
  y^c  if L_sft < γ
  y^s  if L_sft ≥ γ
}

When sample loss falls below threshold γ, the method switches to CoT supervision; otherwise, standard SFT supervision is used.

Experimental Setup

Datasets

MMKP Dataset: 53,701 English samples with 97.32% train-test overlap
MMKP-V2 Dataset: Reconstructed MMKP dataset with overlap reduced to 44.92%
MMKP-360k Dataset: 330,614 training samples and 36,736 test samples with 45.28% overlap

Evaluation Metrics

MMKP and MMKP-V2: F1@1
MMKP-360k: F1@M (M is the number of keyphrases predicted by the model)

Experimental Configuration

Optimizer: AdamW
Learning Rate: 5×10⁻⁵ (MMKP), 3×10⁻⁵ (MMKP-360k)
Training Epochs: 5 epochs for 2B/3B parameter models, 3 epochs for larger models
Dynamic CoT Threshold: γ = 0.4
CoT Data Generation: GPT-4o-2024-05-13 (MMKP), Doubao-1.5-pro (MMKP-360k)

Experimental Results

Main Comparative Results

Model	MMKP All	MMKP-V2 All	MMKP-V2 Absent	MMKP-V2 Unseen	MMKP-360k All	Average
MM-MKP (SOTA)	48.19	-	-	-	-	-
Qwen2.5-VL-7B Zero-shot	6.61	7.75	2.75	8.38	14.34	9.57
Qwen2.5-VL-7B SFT	60.83	30.49	20.90	7.90	43.70	45.01
Qwen2.5-VL-7B Dynamic CoT	63.58	33.56	22.32	13.36	50.66	49.27

Key Findings

VLMs Significantly Outperform Traditional Methods: SFT-based VLMs achieve over 20% improvement compared to SOTA multi-modal methods
Dynamic CoT Effectively Improves Generalization: Achieves 20-30% improvement in unseen scenarios while maintaining overall performance
Substantially Reduced Inference Length: Dynamic CoT reduces computational overhead by 38.48% compared to Fine-tune-CoT

Ablation Study Results

Method	MMKP-V2 All	MMKP-V2 Unseen	Unseen Scenario Improvement
SFT Baseline	30.49	7.90	-
Fine-tune-CoT	33.53	13.42	+69.87%
Multi-task	31.87	9.48	+20.00%
Dynamic CoT	33.56	12.24	+54.94%

Early approaches fall into three categories: extraction-based, classification-based, and generation-based. Following the emergence of LLMs, most methods remain limited to text input. NoteLLM2 employs MLLMs for zero-shot compression but does not explore generating more comprehensive and accurate keyphrases.

Vision-Language Models

Development has progressed from early joint embedding spaces (CLIP) to generative models (Flamingo, BLIP-2), and subsequently to large-scale models (GPT-4V, Qwen-VL, InternVL), with VLMs continuously advancing in cross-modal understanding capabilities.

Reasoning Capabilities

With increased attention to reasoning models, inference-time computation is recognized as an effective method for unleashing LLM potential, with growing research integrating reasoning capabilities into VLMs.

Conclusions and Discussion

Main Conclusions

VLMs demonstrate strong potential in multi-modal keyphrase prediction tasks, significantly outperforming traditional methods
The dynamic CoT strategy effectively balances common pattern learning and generalization ability, particularly excelling in unseen scenarios
Significant discrepancies exist between real data distributions and existing benchmarks, necessitating more realistic evaluation methodologies

Limitations

Empirical Threshold Determination: The threshold γ in dynamic CoT still requires empirical setting, with adaptive strategies showing limited effectiveness
High Computational Overhead: VLMs have large parameter counts (2B+), resulting in higher inference costs than traditional methods
High CoT Data Cost: Generating high-quality CoT data requires substantial computational resources

Future Directions

Explore more intelligent dynamic threshold selection strategies
Investigate model compression techniques to reduce inference overhead
Develop more efficient CoT data generation methods

In-Depth Evaluation

Strengths

Accurate Problem Identification: Precisely identifies issues with existing benchmarks and challenges in real-world scenarios
Ingenious Method Design: The dynamic CoT strategy maintains reasoning capabilities while avoiding overthinking
Comprehensive Experimentation: Comparative validation across multiple datasets and models demonstrates method robustness
High Practical Value: The method has been deployed in ByteDance's production environment

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation and convergence analysis of the dynamic CoT strategy
Limited Human Evaluation: Human evaluation covers only 20 samples per dataset, potentially insufficient
Unverified Cross-Domain Generalization: Method effectiveness has not been validated in other domains (e.g., academic papers, news)

Impact

Academic Contribution: First systematic investigation of VLMs' application in MMKP tasks, establishing foundation for subsequent research
Practical Value: Provides directly applicable solutions for production environments
Methodological Inspiration: Dynamic CoT strategy is generalizable to other tasks requiring efficiency-performance balance

Applicable Scenarios

Social Media Platforms: Automatic hashtag and label generation
Content Recommendation Systems: Multi-modal content understanding for precise recommendations
Advertising Deployment: Automatic keyphrase extraction for targeted advertising
Content Moderation: Auxiliary identification and classification of multi-modal content

References

This paper cites important works in multi-modal learning, vision-language models, and reasoning capabilities, providing solid theoretical foundations. Particularly noteworthy are representative models such as CLIP, GPT-4V, and InternVL, as well as recent advances in CoT reasoning.

Overall Assessment: This is a high-quality applied research paper that accurately identifies practical problems, proposes effective solutions, and validates method effectiveness across multiple datasets. The dynamic CoT strategy design is ingenious, maintaining model reasoning capabilities while improving inference efficiency, demonstrating strong practical value. The paper's primary contribution lies in successfully applying VLMs to multi-modal keyphrase prediction tasks and proposing optimization strategies suitable for production environments.