2025-11-18T20:07:12.683154

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Cao, Chen, Wang et al.

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

academic

Basic Information

Paper ID: 2510.10466
Title: When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
Authors: Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi (MAPLE Lab, Westlake University)
Category: cs.CV (Computer Vision)
Publication Date: October 12, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10466v1

Abstract

Vision-Language Models (VLMs) demonstrate exceptional performance in multimodal understanding but frequently suffer from hallucination problems—generating linguistically fluent yet visually irrelevant responses. This paper analyzes how language bias induces hallucinations and proposes Cross-Modal Guidance (CMG), a training-free decoding method that addresses hallucinations by contrasting output distributions between the original model and a vision-language attention-degraded model. CMG disrupts vision-language perception through adaptive masking of attention weights for the most influential image tokens in selected transformer layers, reinforcing visual context awareness and significantly reducing language bias without compromising VLM capabilities.

Research Background and Motivation

Core Problem

Despite their powerful multimodal understanding capabilities, VLMs suffer from severe hallucination issues:

Language Bias-Driven Hallucinations: Models tend to generate responses based on linguistic patterns while disregarding visual information
Imbalanced Attention Weights: Attention weights for image tokens sharply decline in deeper network layers
Underutilized Visual Information: Despite image tokens typically outnumbering text tokens, their influence is underestimated

Problem Significance

VLM hallucinations impede widespread deployment and introduce uncontrollable risks
Users require reliable multimodal AI systems that accurately understand and respond to visual content
Existing solutions either require additional training or have limited effectiveness

Limitations of Existing Methods

VCD Method: Directly adds Gaussian noise to input images, but such perturbations become uncontrollable in deeper networks
ConVis Method: Requires invoking expensive auxiliary models to enhance visual information
Prompt Engineering Methods: Limited effectiveness and insufficient generalizability
Post-training Methods: Require human feedback data and additional training costs

Core Contributions

Proposes CMG Method: A training-free inference approach that effectively reduces model hallucinations through random attention masking
Identifies Hallucination Root Causes: Discovers that insufficient vision-attention connections are a critical factor in hallucination generation, providing rigorous evidence
Comprehensive Experimental Validation: Quantitatively evaluates CMG effectiveness across multiple benchmarks, demonstrating its generalization capability
Refined Theoretical Framework: Establishes theoretical foundations for contrastive decoding based on Pointwise Mutual Information (PMI)

Methodology Details

Task Definition

Given textual input $x = \{x_1, x_2, ..., x_n\}$ and visual input $I = \{I_1, I_2, ..., I_m\}$ , a VLM must generate a text sequence $y = \{y_1, y_2, ..., y_k\}$ of length k. The generation process follows an autoregressive pattern:

$p_\theta(y|x,I) = \prod_{t=1}^k p_\theta(y_t|y_{<t}, x, I)$

Language Bias Analysis

Research reveals significant language bias in VLMs:

Attention Weight Decay: Image token attention weights sharply decline in shallow layers and remain low in deeper layers
Text Token Dominance: System token attention weights even exceed those of question tokens containing critical information
Sequence Length Impact: Image attention weights gradually decrease as generated sequences lengthen

CMG Core Architecture

1. Amateur Model Construction

Self-attention mechanisms contain three types:

Intra-vision attention $A_{iv}$
Intra-text attention $A_{it}$
Cross-modal attention $A_{cr}$

$A = A_{iv} \cup A_{it} \cup A_{cr}$

An amateur model is constructed by masking portions of cross-modal and intra-vision attention weights:

$SA(Q,K,V;M) = \text{Softmax}(A \odot M)V$

where $M := M_{cr} \cup M_{iv}$ is the mask applied to the attention map.

2. Contrastive Decoding Strategy

Adjusts the output distribution of the original VLM:

$p_\theta(y|x,I) \propto q_\theta(y) \left(\frac{q_\theta(y)}{q_\theta(y;M)}\right)^\alpha$

where:

$q_\theta(y) := p_\theta(y|x,I;A_{cr}, A_{iv}, A_{it})$ (original model)
$q_\theta(y;M) := p_\theta(y|x,I;A_{cr} \odot M_{cr}, A_{iv} \odot M_{iv}, A_{it})$ (amateur model)

3. Dynamic Masking Strategy

Dynamic Attention Masking: Masks the largest $\gamma$ proportion of attention weights in $A_{iv}$ and $A_{cr}$ :

$SA(Q,K,V;M) = \text{Softmax}(A \odot M(\gamma))V$

Dynamic Layer Selection: Selects important layers based on cosine similarity:

$s(i) = \cos(X_i, Y_i) = \frac{X_i \cdot Y_i}{\|X_i\|_2 \|Y_i\|_2}$

Masks the $\tau$ proportion of layers with the smallest similarity.

Technical Innovations

Internal Attention Mechanism Operation: Directly manipulates attention weights within transformers rather than perturbing inputs
Adaptive Masking Strategy: Dynamically selects the most influential attention weights and layers for masking
Theory-Driven Design: Constructs contrastive decoding framework based on PMI theory
Training-Free: Operates entirely during inference without requiring additional training

Experimental Setup

Datasets

Hallucination-Specific Benchmarks: HallusionBench, POPE
Comprehensive Evaluation Benchmarks: MME

Evaluation Metrics

POPE: Recall, Accuracy, Precision, Overall Score
HallusionBench: Question Accuracy (qAcc), Image Accuracy (fAcc), Overall Accuracy (aAcc)
MME: Scores on 14 sub-tasks for perception and reasoning capabilities

Comparison Methods

VCD: Constructs amateur model by adding Gaussian noise to input images
ConVis: Uses text-to-image models to regenerate images and leverages differences to guide generation

Implementation Details

Backbone Models: LLaVA-v1.5-7B, InstructBLIP-7B, Qwen2-VL-7B, InternVL2.5-8B
Parameter Settings:
- Hallucination-specific benchmarks: $\alpha=0.3, \gamma=0.5, \tau=0.5$
- General benchmark MME: $\alpha=0.1, \gamma=0.5, \tau=0.1$
Sampling Parameters: top-p=0.9, beam search=5, temperature=0.7

Experimental Results

Main Results

POPE Benchmark

On LLaVA-v1.5-7B, CMG achieves an overall accuracy of 85.48, surpassing VCD and ConVis. Notably, CMG demonstrates positive scaling on new architectures (improving from 89.0 to 89.3 on InternVL-2.5), while traditional methods show performance degradation with architecture upgrades.

HallusionBench Benchmark

CMG surpasses VCD by +7.1 points and ConVis by +6.3 points in accuracy, achieving leading performance among training-free inference methods.

MME Benchmark

On perception-related sub-tasks, CMG's total score exceeds VCD by +62.08 points and ConVis by +7.30 points. Achieves highest scores on subsets where language bias is particularly prevalent, such as "color," "scene," and "landmark."

Results Across Different Model Scales

CMG demonstrates robust performance improvements across models of varying parameter scales (2B, 7B, 13B, 26B), exhibiting good scalability and architectural adaptability.

Ablation Studies

Experiments validate several amateur model construction strategies:

Complete Visual Attention Removal: Severe performance degradation (fAcc: 12.14)
Noise Replacement: Limited performance (fAcc: 29.48)
Text Replacement: Moderate effectiveness (fAcc: 29.77)
CMG Method: Optimal performance (fAcc: 30.06)

Case Analysis

The paper presents two representative cases:

Painting Understanding Task: Original model incorrectly associates "hat" with character clothing; CMG successfully corrects this and identifies "bandana"
T-shirt Color Identification: Facing interference from a black hat, CMG accurately identifies the T-shirt color by adjusting the PMI ratio

Hallucination Problem Research

VLM hallucination has become an important research direction, with existing methods primarily including:

Prompt engineering approaches
Post-training based on human feedback
Alternative inference strategies

Content-Aware Decoding

Search Methods: Such as greedy search and beam search, accurate results but potentially repetitive
Sampling Methods: Such as nucleus sampling, better diversity but possible unnatural topic transitions
Contrastive Decoding: Leverages differences between two output probabilities to construct enhanced output distributions

Conclusions and Discussion

Main Conclusions

CMG Effectiveness: Significantly reduces VLM hallucinations without requiring training
Language Bias Impact: Confirms that language bias is a critical factor in hallucination generation
Attention Mechanism Importance: Manipulating attention weights can effectively improve model behavior
Broad Applicability: Method demonstrates excellent performance across multiple model architectures and benchmarks

Limitations

Hyperparameter Sensitivity: Requires careful hyperparameter tuning for different scenarios, such as masking ratios related to $n_0$ in Equation 12
Dynamic Tuning Requirements: Obtaining optimal results currently requires dynamic hyperparameter tuning, increasing usage complexity
Computational Overhead: Requires running both original and amateur models simultaneously, increasing inference time

Future Directions

Automatic Hyperparameter Tuning: Develop adaptive parameter selection mechanisms
Efficiency Optimization: Reduce computational overhead and improve inference efficiency
Theoretical Refinement: Further enhance theoretical foundations of contrastive decoding

In-Depth Evaluation

Strengths

Strong Innovation: First to address VLM hallucinations from the attention mechanism perspective, providing novel research insights
Solid Theoretical Foundation: Contrastive decoding framework based on PMI possesses rigorous theoretical grounding
Comprehensive Experiments: Sufficient validation across multiple benchmarks and model types
High Practical Value: Applicable without training, lowering usage barriers
In-Depth Analysis: Analysis of language bias generation mechanisms provides important insights

Weaknesses

High Complexity: Involves multiple hyperparameters and dynamic selection strategies, increasing usage complexity
Computational Cost: Requires running two models simultaneously, increasing inference costs
Parameter Sensitivity: Performance is relatively sensitive to hyperparameter selection, potentially affecting practical applications
Limited Scope: Primarily targets transformer-based VLMs; applicability to other architectures remains unknown

Impact

Academic Contribution: Provides novel solutions to VLM hallucination problems, potentially inspiring subsequent research
Practical Value: Training-free nature facilitates easy deployment in existing systems
Reproducibility: Detailed method descriptions and clear experimental settings ensure good reproducibility

Applicable Scenarios

Applications requiring high-quality visual understanding
Safety-critical applications sensitive to hallucination problems
Resource-constrained environments unable to perform additional training
Commercial applications requiring rapid deployment

References

The paper cites 62 relevant references covering important works in VLMs, hallucination detection, contrastive decoding, and related fields, providing sufficient theoretical foundation and comparative benchmarks.

Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the important research direction of VLM hallucinations. The method possesses solid theoretical foundations and excellent experimental performance, holding significant value for both academia and industry. Despite certain limitations, its contributions and impact are noteworthy.