2025-11-18T20:07:12.683154

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Cao, Chen, Wang et al.
Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.
academic

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Basic Information

  • Paper ID: 2510.10466
  • Title: When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
  • Authors: Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi (MAPLE Lab, Westlake University)
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 12, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10466v1

Abstract

Vision-Language Models (VLMs) demonstrate exceptional performance in multimodal understanding but frequently suffer from hallucination problems—generating linguistically fluent yet visually irrelevant responses. This paper analyzes how language bias induces hallucinations and proposes Cross-Modal Guidance (CMG), a training-free decoding method that addresses hallucinations by contrasting output distributions between the original model and a vision-language attention-degraded model. CMG disrupts vision-language perception through adaptive masking of attention weights for the most influential image tokens in selected transformer layers, reinforcing visual context awareness and significantly reducing language bias without compromising VLM capabilities.

Research Background and Motivation

Core Problem

Despite their powerful multimodal understanding capabilities, VLMs suffer from severe hallucination issues:

  1. Language Bias-Driven Hallucinations: Models tend to generate responses based on linguistic patterns while disregarding visual information
  2. Imbalanced Attention Weights: Attention weights for image tokens sharply decline in deeper network layers
  3. Underutilized Visual Information: Despite image tokens typically outnumbering text tokens, their influence is underestimated

Problem Significance

  • VLM hallucinations impede widespread deployment and introduce uncontrollable risks
  • Users require reliable multimodal AI systems that accurately understand and respond to visual content
  • Existing solutions either require additional training or have limited effectiveness

Limitations of Existing Methods

  1. VCD Method: Directly adds Gaussian noise to input images, but such perturbations become uncontrollable in deeper networks
  2. ConVis Method: Requires invoking expensive auxiliary models to enhance visual information
  3. Prompt Engineering Methods: Limited effectiveness and insufficient generalizability
  4. Post-training Methods: Require human feedback data and additional training costs

Core Contributions

  1. Proposes CMG Method: A training-free inference approach that effectively reduces model hallucinations through random attention masking
  2. Identifies Hallucination Root Causes: Discovers that insufficient vision-attention connections are a critical factor in hallucination generation, providing rigorous evidence
  3. Comprehensive Experimental Validation: Quantitatively evaluates CMG effectiveness across multiple benchmarks, demonstrating its generalization capability
  4. Refined Theoretical Framework: Establishes theoretical foundations for contrastive decoding based on Pointwise Mutual Information (PMI)

Methodology Details

Task Definition

Given textual input x={x1,x2,...,xn}x = \{x_1, x_2, ..., x_n\} and visual input I={I1,I2,...,Im}I = \{I_1, I_2, ..., I_m\}, a VLM must generate a text sequence y={y1,y2,...,yk}y = \{y_1, y_2, ..., y_k\} of length k. The generation process follows an autoregressive pattern:

pθ(yx,I)=t=1kpθ(yty<t,x,I)p_\theta(y|x,I) = \prod_{t=1}^k p_\theta(y_t|y_{<t}, x, I)

Language Bias Analysis

Research reveals significant language bias in VLMs:

  1. Attention Weight Decay: Image token attention weights sharply decline in shallow layers and remain low in deeper layers
  2. Text Token Dominance: System token attention weights even exceed those of question tokens containing critical information
  3. Sequence Length Impact: Image attention weights gradually decrease as generated sequences lengthen

CMG Core Architecture

1. Amateur Model Construction

Self-attention mechanisms contain three types:

  • Intra-vision attention AivA_{iv}
  • Intra-text attention AitA_{it}
  • Cross-modal attention AcrA_{cr}

A=AivAitAcrA = A_{iv} \cup A_{it} \cup A_{cr}

An amateur model is constructed by masking portions of cross-modal and intra-vision attention weights:

SA(Q,K,V;M)=Softmax(AM)VSA(Q,K,V;M) = \text{Softmax}(A \odot M)V

where M:=McrMivM := M_{cr} \cup M_{iv} is the mask applied to the attention map.

2. Contrastive Decoding Strategy

Adjusts the output distribution of the original VLM:

pθ(yx,I)qθ(y)(qθ(y)qθ(y;M))αp_\theta(y|x,I) \propto q_\theta(y) \left(\frac{q_\theta(y)}{q_\theta(y;M)}\right)^\alpha

where:

  • qθ(y):=pθ(yx,I;Acr,Aiv,Ait)q_\theta(y) := p_\theta(y|x,I;A_{cr}, A_{iv}, A_{it}) (original model)
  • qθ(y;M):=pθ(yx,I;AcrMcr,AivMiv,Ait)q_\theta(y;M) := p_\theta(y|x,I;A_{cr} \odot M_{cr}, A_{iv} \odot M_{iv}, A_{it}) (amateur model)

3. Dynamic Masking Strategy

Dynamic Attention Masking: Masks the largest γ\gamma proportion of attention weights in AivA_{iv} and AcrA_{cr}:

SA(Q,K,V;M)=Softmax(AM(γ))VSA(Q,K,V;M) = \text{Softmax}(A \odot M(\gamma))V

Dynamic Layer Selection: Selects important layers based on cosine similarity:

s(i)=cos(Xi,Yi)=XiYiXi2Yi2s(i) = \cos(X_i, Y_i) = \frac{X_i \cdot Y_i}{\|X_i\|_2 \|Y_i\|_2}

Masks the τ\tau proportion of layers with the smallest similarity.

Technical Innovations

  1. Internal Attention Mechanism Operation: Directly manipulates attention weights within transformers rather than perturbing inputs
  2. Adaptive Masking Strategy: Dynamically selects the most influential attention weights and layers for masking
  3. Theory-Driven Design: Constructs contrastive decoding framework based on PMI theory
  4. Training-Free: Operates entirely during inference without requiring additional training

Experimental Setup

Datasets

  • Hallucination-Specific Benchmarks: HallusionBench, POPE
  • Comprehensive Evaluation Benchmarks: MME

Evaluation Metrics

  • POPE: Recall, Accuracy, Precision, Overall Score
  • HallusionBench: Question Accuracy (qAcc), Image Accuracy (fAcc), Overall Accuracy (aAcc)
  • MME: Scores on 14 sub-tasks for perception and reasoning capabilities

Comparison Methods

  • VCD: Constructs amateur model by adding Gaussian noise to input images
  • ConVis: Uses text-to-image models to regenerate images and leverages differences to guide generation

Implementation Details

  • Backbone Models: LLaVA-v1.5-7B, InstructBLIP-7B, Qwen2-VL-7B, InternVL2.5-8B
  • Parameter Settings:
    • Hallucination-specific benchmarks: α=0.3,γ=0.5,τ=0.5\alpha=0.3, \gamma=0.5, \tau=0.5
    • General benchmark MME: α=0.1,γ=0.5,τ=0.1\alpha=0.1, \gamma=0.5, \tau=0.1
  • Sampling Parameters: top-p=0.9, beam search=5, temperature=0.7

Experimental Results

Main Results

POPE Benchmark

On LLaVA-v1.5-7B, CMG achieves an overall accuracy of 85.48, surpassing VCD and ConVis. Notably, CMG demonstrates positive scaling on new architectures (improving from 89.0 to 89.3 on InternVL-2.5), while traditional methods show performance degradation with architecture upgrades.

HallusionBench Benchmark

CMG surpasses VCD by +7.1 points and ConVis by +6.3 points in accuracy, achieving leading performance among training-free inference methods.

MME Benchmark

On perception-related sub-tasks, CMG's total score exceeds VCD by +62.08 points and ConVis by +7.30 points. Achieves highest scores on subsets where language bias is particularly prevalent, such as "color," "scene," and "landmark."

Results Across Different Model Scales

CMG demonstrates robust performance improvements across models of varying parameter scales (2B, 7B, 13B, 26B), exhibiting good scalability and architectural adaptability.

Ablation Studies

Experiments validate several amateur model construction strategies:

  • Complete Visual Attention Removal: Severe performance degradation (fAcc: 12.14)
  • Noise Replacement: Limited performance (fAcc: 29.48)
  • Text Replacement: Moderate effectiveness (fAcc: 29.77)
  • CMG Method: Optimal performance (fAcc: 30.06)

Case Analysis

The paper presents two representative cases:

  1. Painting Understanding Task: Original model incorrectly associates "hat" with character clothing; CMG successfully corrects this and identifies "bandana"
  2. T-shirt Color Identification: Facing interference from a black hat, CMG accurately identifies the T-shirt color by adjusting the PMI ratio

Hallucination Problem Research

VLM hallucination has become an important research direction, with existing methods primarily including:

  • Prompt engineering approaches
  • Post-training based on human feedback
  • Alternative inference strategies

Content-Aware Decoding

  • Search Methods: Such as greedy search and beam search, accurate results but potentially repetitive
  • Sampling Methods: Such as nucleus sampling, better diversity but possible unnatural topic transitions
  • Contrastive Decoding: Leverages differences between two output probabilities to construct enhanced output distributions

Conclusions and Discussion

Main Conclusions

  1. CMG Effectiveness: Significantly reduces VLM hallucinations without requiring training
  2. Language Bias Impact: Confirms that language bias is a critical factor in hallucination generation
  3. Attention Mechanism Importance: Manipulating attention weights can effectively improve model behavior
  4. Broad Applicability: Method demonstrates excellent performance across multiple model architectures and benchmarks

Limitations

  1. Hyperparameter Sensitivity: Requires careful hyperparameter tuning for different scenarios, such as masking ratios related to n0n_0 in Equation 12
  2. Dynamic Tuning Requirements: Obtaining optimal results currently requires dynamic hyperparameter tuning, increasing usage complexity
  3. Computational Overhead: Requires running both original and amateur models simultaneously, increasing inference time

Future Directions

  1. Automatic Hyperparameter Tuning: Develop adaptive parameter selection mechanisms
  2. Efficiency Optimization: Reduce computational overhead and improve inference efficiency
  3. Theoretical Refinement: Further enhance theoretical foundations of contrastive decoding

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to address VLM hallucinations from the attention mechanism perspective, providing novel research insights
  2. Solid Theoretical Foundation: Contrastive decoding framework based on PMI possesses rigorous theoretical grounding
  3. Comprehensive Experiments: Sufficient validation across multiple benchmarks and model types
  4. High Practical Value: Applicable without training, lowering usage barriers
  5. In-Depth Analysis: Analysis of language bias generation mechanisms provides important insights

Weaknesses

  1. High Complexity: Involves multiple hyperparameters and dynamic selection strategies, increasing usage complexity
  2. Computational Cost: Requires running two models simultaneously, increasing inference costs
  3. Parameter Sensitivity: Performance is relatively sensitive to hyperparameter selection, potentially affecting practical applications
  4. Limited Scope: Primarily targets transformer-based VLMs; applicability to other architectures remains unknown

Impact

  1. Academic Contribution: Provides novel solutions to VLM hallucination problems, potentially inspiring subsequent research
  2. Practical Value: Training-free nature facilitates easy deployment in existing systems
  3. Reproducibility: Detailed method descriptions and clear experimental settings ensure good reproducibility

Applicable Scenarios

  • Applications requiring high-quality visual understanding
  • Safety-critical applications sensitive to hallucination problems
  • Resource-constrained environments unable to perform additional training
  • Commercial applications requiring rapid deployment

References

The paper cites 62 relevant references covering important works in VLMs, hallucination detection, contrastive decoding, and related fields, providing sufficient theoretical foundation and comparative benchmarks.


Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the important research direction of VLM hallucinations. The method possesses solid theoretical foundations and excellent experimental performance, holding significant value for both academia and industry. Despite certain limitations, its contributions and impact are noteworthy.