Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.
- Paper ID: 2510.10466
- Title: When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
- Authors: Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi (MAPLE Lab, Westlake University)
- Category: cs.CV (Computer Vision)
- Publication Date: October 12, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.10466v1
Vision-Language Models (VLMs) demonstrate exceptional performance in multimodal understanding but frequently suffer from hallucination problems—generating linguistically fluent yet visually irrelevant responses. This paper analyzes how language bias induces hallucinations and proposes Cross-Modal Guidance (CMG), a training-free decoding method that addresses hallucinations by contrasting output distributions between the original model and a vision-language attention-degraded model. CMG disrupts vision-language perception through adaptive masking of attention weights for the most influential image tokens in selected transformer layers, reinforcing visual context awareness and significantly reducing language bias without compromising VLM capabilities.
Despite their powerful multimodal understanding capabilities, VLMs suffer from severe hallucination issues:
- Language Bias-Driven Hallucinations: Models tend to generate responses based on linguistic patterns while disregarding visual information
- Imbalanced Attention Weights: Attention weights for image tokens sharply decline in deeper network layers
- Underutilized Visual Information: Despite image tokens typically outnumbering text tokens, their influence is underestimated
- VLM hallucinations impede widespread deployment and introduce uncontrollable risks
- Users require reliable multimodal AI systems that accurately understand and respond to visual content
- Existing solutions either require additional training or have limited effectiveness
- VCD Method: Directly adds Gaussian noise to input images, but such perturbations become uncontrollable in deeper networks
- ConVis Method: Requires invoking expensive auxiliary models to enhance visual information
- Prompt Engineering Methods: Limited effectiveness and insufficient generalizability
- Post-training Methods: Require human feedback data and additional training costs
- Proposes CMG Method: A training-free inference approach that effectively reduces model hallucinations through random attention masking
- Identifies Hallucination Root Causes: Discovers that insufficient vision-attention connections are a critical factor in hallucination generation, providing rigorous evidence
- Comprehensive Experimental Validation: Quantitatively evaluates CMG effectiveness across multiple benchmarks, demonstrating its generalization capability
- Refined Theoretical Framework: Establishes theoretical foundations for contrastive decoding based on Pointwise Mutual Information (PMI)
Given textual input x={x1,x2,...,xn} and visual input I={I1,I2,...,Im}, a VLM must generate a text sequence y={y1,y2,...,yk} of length k. The generation process follows an autoregressive pattern:
pθ(y∣x,I)=∏t=1kpθ(yt∣y<t,x,I)
Research reveals significant language bias in VLMs:
- Attention Weight Decay: Image token attention weights sharply decline in shallow layers and remain low in deeper layers
- Text Token Dominance: System token attention weights even exceed those of question tokens containing critical information
- Sequence Length Impact: Image attention weights gradually decrease as generated sequences lengthen
Self-attention mechanisms contain three types:
- Intra-vision attention Aiv
- Intra-text attention Ait
- Cross-modal attention Acr
A=Aiv∪Ait∪Acr
An amateur model is constructed by masking portions of cross-modal and intra-vision attention weights:
SA(Q,K,V;M)=Softmax(A⊙M)V
where M:=Mcr∪Miv is the mask applied to the attention map.
Adjusts the output distribution of the original VLM:
pθ(y∣x,I)∝qθ(y)(qθ(y;M)qθ(y))α
where:
- qθ(y):=pθ(y∣x,I;Acr,Aiv,Ait) (original model)
- qθ(y;M):=pθ(y∣x,I;Acr⊙Mcr,Aiv⊙Miv,Ait) (amateur model)
Dynamic Attention Masking: Masks the largest γ proportion of attention weights in Aiv and Acr:
SA(Q,K,V;M)=Softmax(A⊙M(γ))V
Dynamic Layer Selection: Selects important layers based on cosine similarity:
s(i)=cos(Xi,Yi)=∥Xi∥2∥Yi∥2Xi⋅Yi
Masks the τ proportion of layers with the smallest similarity.
- Internal Attention Mechanism Operation: Directly manipulates attention weights within transformers rather than perturbing inputs
- Adaptive Masking Strategy: Dynamically selects the most influential attention weights and layers for masking
- Theory-Driven Design: Constructs contrastive decoding framework based on PMI theory
- Training-Free: Operates entirely during inference without requiring additional training
- Hallucination-Specific Benchmarks: HallusionBench, POPE
- Comprehensive Evaluation Benchmarks: MME
- POPE: Recall, Accuracy, Precision, Overall Score
- HallusionBench: Question Accuracy (qAcc), Image Accuracy (fAcc), Overall Accuracy (aAcc)
- MME: Scores on 14 sub-tasks for perception and reasoning capabilities
- VCD: Constructs amateur model by adding Gaussian noise to input images
- ConVis: Uses text-to-image models to regenerate images and leverages differences to guide generation
- Backbone Models: LLaVA-v1.5-7B, InstructBLIP-7B, Qwen2-VL-7B, InternVL2.5-8B
- Parameter Settings:
- Hallucination-specific benchmarks: α=0.3,γ=0.5,τ=0.5
- General benchmark MME: α=0.1,γ=0.5,τ=0.1
- Sampling Parameters: top-p=0.9, beam search=5, temperature=0.7
On LLaVA-v1.5-7B, CMG achieves an overall accuracy of 85.48, surpassing VCD and ConVis. Notably, CMG demonstrates positive scaling on new architectures (improving from 89.0 to 89.3 on InternVL-2.5), while traditional methods show performance degradation with architecture upgrades.
CMG surpasses VCD by +7.1 points and ConVis by +6.3 points in accuracy, achieving leading performance among training-free inference methods.
On perception-related sub-tasks, CMG's total score exceeds VCD by +62.08 points and ConVis by +7.30 points. Achieves highest scores on subsets where language bias is particularly prevalent, such as "color," "scene," and "landmark."
CMG demonstrates robust performance improvements across models of varying parameter scales (2B, 7B, 13B, 26B), exhibiting good scalability and architectural adaptability.
Experiments validate several amateur model construction strategies:
- Complete Visual Attention Removal: Severe performance degradation (fAcc: 12.14)
- Noise Replacement: Limited performance (fAcc: 29.48)
- Text Replacement: Moderate effectiveness (fAcc: 29.77)
- CMG Method: Optimal performance (fAcc: 30.06)
The paper presents two representative cases:
- Painting Understanding Task: Original model incorrectly associates "hat" with character clothing; CMG successfully corrects this and identifies "bandana"
- T-shirt Color Identification: Facing interference from a black hat, CMG accurately identifies the T-shirt color by adjusting the PMI ratio
VLM hallucination has become an important research direction, with existing methods primarily including:
- Prompt engineering approaches
- Post-training based on human feedback
- Alternative inference strategies
- Search Methods: Such as greedy search and beam search, accurate results but potentially repetitive
- Sampling Methods: Such as nucleus sampling, better diversity but possible unnatural topic transitions
- Contrastive Decoding: Leverages differences between two output probabilities to construct enhanced output distributions
- CMG Effectiveness: Significantly reduces VLM hallucinations without requiring training
- Language Bias Impact: Confirms that language bias is a critical factor in hallucination generation
- Attention Mechanism Importance: Manipulating attention weights can effectively improve model behavior
- Broad Applicability: Method demonstrates excellent performance across multiple model architectures and benchmarks
- Hyperparameter Sensitivity: Requires careful hyperparameter tuning for different scenarios, such as masking ratios related to n0 in Equation 12
- Dynamic Tuning Requirements: Obtaining optimal results currently requires dynamic hyperparameter tuning, increasing usage complexity
- Computational Overhead: Requires running both original and amateur models simultaneously, increasing inference time
- Automatic Hyperparameter Tuning: Develop adaptive parameter selection mechanisms
- Efficiency Optimization: Reduce computational overhead and improve inference efficiency
- Theoretical Refinement: Further enhance theoretical foundations of contrastive decoding
- Strong Innovation: First to address VLM hallucinations from the attention mechanism perspective, providing novel research insights
- Solid Theoretical Foundation: Contrastive decoding framework based on PMI possesses rigorous theoretical grounding
- Comprehensive Experiments: Sufficient validation across multiple benchmarks and model types
- High Practical Value: Applicable without training, lowering usage barriers
- In-Depth Analysis: Analysis of language bias generation mechanisms provides important insights
- High Complexity: Involves multiple hyperparameters and dynamic selection strategies, increasing usage complexity
- Computational Cost: Requires running two models simultaneously, increasing inference costs
- Parameter Sensitivity: Performance is relatively sensitive to hyperparameter selection, potentially affecting practical applications
- Limited Scope: Primarily targets transformer-based VLMs; applicability to other architectures remains unknown
- Academic Contribution: Provides novel solutions to VLM hallucination problems, potentially inspiring subsequent research
- Practical Value: Training-free nature facilitates easy deployment in existing systems
- Reproducibility: Detailed method descriptions and clear experimental settings ensure good reproducibility
- Applications requiring high-quality visual understanding
- Safety-critical applications sensitive to hallucination problems
- Resource-constrained environments unable to perform additional training
- Commercial applications requiring rapid deployment
The paper cites 62 relevant references covering important works in VLMs, hallucination detection, contrastive decoding, and related fields, providing sufficient theoretical foundation and comparative benchmarks.
Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the important research direction of VLM hallucinations. The method possesses solid theoretical foundations and excellent experimental performance, holding significant value for both academia and industry. Despite certain limitations, its contributions and impact are noteworthy.