Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Ma, Li, Tang et al.
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.
academic
Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Multi-modal keyphrase prediction (MMKP) aims to transcend pure text-based approaches by integrating multi-modal input information to generate a set of conclusive phrases. Traditional multi-modal methods exhibit significant limitations in handling absence scenarios and unseen scenarios. Furthermore, existing benchmarks overestimate model capabilities due to severe train-test set overlap. This paper proposes leveraging vision-language models (VLMs) to address the MMKP task. First, we evaluate the lower-bound performance of VLMs using both zero-shot and supervised fine-tuning (SFT) strategies. Subsequently, we employ the Fine-tune-CoT approach, utilizing high-quality chain-of-thought (CoT) reasoning data generated by teacher models to fine-tune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy that adaptively injects CoT data during training, enabling models to flexibly leverage reasoning capabilities during inference.
The multi-modal keyphrase prediction (MMKP) task aims to generate concise, information-rich key phrases (such as hashtags) for social media content containing both text and images. This task holds significant value in applications including social media content understanding, recommendation systems, and content classification.
Constraints of Traditional Multi-modal Approaches: Existing methods such as M3H-ATT and MM-MKP primarily rely on designing cross-modal fusion architectures but perform poorly in complex scenarios, particularly:
Absence Scenario: Predicted keyphrases do not exist in the input text, requiring robust cross-modal interaction capabilities
Unseen Scenario: Predicted keyphrases have not appeared in the training set, demanding strong generalization ability from the model
Dataset Issues: Public MMKP datasets suffer from severe train-test overlap problems, with 97.32% of test set keyphrases appearing in the training set, whereas this ratio is only 45.28% in real production environments
Model Capacity Limitations: Traditional methods are constrained by limited model capacity and world knowledge, making it difficult to handle content involving memes, current events, and other topics requiring external knowledge
First Systematic Study: To the authors' knowledge, this is the first comprehensive investigation of VLMs' potential in multi-modal keyphrase prediction tasks
Dynamic CoT Strategy: Proposes a dynamic chain-of-thought strategy enabling VLMs to adaptively select CoT reasoning for difficult unseen samples, better suited for production environments requiring efficient decoding
Dataset Reconstruction: Constructs MMKP-V2 and MMKP-360k datasets that better reflect real-world distributions
Comprehensive Experimental Validation: Conducts rigorous analysis across multiple datasets, verifying method effectiveness and robustness
Given multi-modal input (text T and image I), the MMKP task requires generating a set of key phrases K = {k₁, k₂, ..., kₙ} that capture the core information of the input content.
Traditional multi-modal models employ multi-task loss functions:
L(θ) = -∑[log P_cls(y^n) + γ · ∑log P_gen(y^n_t)]
where the first term represents classification loss and the second term represents keyphrase generation loss. This approach limits open-set generation capabilities.
Early approaches fall into three categories: extraction-based, classification-based, and generation-based. Following the emergence of LLMs, most methods remain limited to text input. NoteLLM2 employs MLLMs for zero-shot compression but does not explore generating more comprehensive and accurate keyphrases.
Development has progressed from early joint embedding spaces (CLIP) to generative models (Flamingo, BLIP-2), and subsequently to large-scale models (GPT-4V, Qwen-VL, InternVL), with VLMs continuously advancing in cross-modal understanding capabilities.
With increased attention to reasoning models, inference-time computation is recognized as an effective method for unleashing LLM potential, with growing research integrating reasoning capabilities into VLMs.
Empirical Threshold Determination: The threshold γ in dynamic CoT still requires empirical setting, with adaptive strategies showing limited effectiveness
High Computational Overhead: VLMs have large parameter counts (2B+), resulting in higher inference costs than traditional methods
High CoT Data Cost: Generating high-quality CoT data requires substantial computational resources
This paper cites important works in multi-modal learning, vision-language models, and reasoning capabilities, providing solid theoretical foundations. Particularly noteworthy are representative models such as CLIP, GPT-4V, and InternVL, as well as recent advances in CoT reasoning.
Overall Assessment: This is a high-quality applied research paper that accurately identifies practical problems, proposes effective solutions, and validates method effectiveness across multiple datasets. The dynamic CoT strategy design is ingenious, maintaining model reasoning capabilities while improving inference efficiency, demonstrating strong practical value. The paper's primary contribution lies in successfully applying VLMs to multi-modal keyphrase prediction tasks and proposing optimization strategies suitable for production environments.