2025-11-18T20:07:12.683154

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

Cao, Chen, Wang et al.

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

academic

基本信息

论文ID: 2510.10466
标题: When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
作者: Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi (MAPLE Lab, Westlake University)
分类: cs.CV (Computer Vision)
发表时间: 2025年10月12日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.10466v1

摘要

视觉语言模型(VLMs)在多模态理解方面表现出色，但经常面临幻觉问题——生成语言流畅但与图像内容无关的回答。本文分析了语言偏见如何导致幻觉，并提出了Cross-Modal Guidance(CMG)，这是一种无需训练的解码方法，通过对比原始模型和视觉-语言注意力退化模型的输出分布来解决幻觉问题。CMG通过自适应掩蔽选定transformer层中最具影响力的图像token的注意力权重来破坏视觉-语言感知，强化对视觉上下文的感知，显著减少语言偏见而不损害VLMs的能力。

研究背景与动机

核心问题

VLMs虽然在多模态理解方面能力强大，但存在严重的幻觉问题：

语言偏见驱动的幻觉：模型倾向于基于语言模式生成回答，而忽视视觉信息
注意力权重失衡：图像token的注意力权重在深层网络中急剧下降
视觉信息利用不足：尽管图像token数量通常远超文本token，但其影响力被低估

问题重要性

VLMs的幻觉问题阻碍了其广泛应用，带来不可控的风险
用户需要可靠的多模态AI系统，准确理解和响应视觉内容
现有解决方案要么需要额外训练，要么效果有限

现有方法局限性

VCD方法：直接对输入图像添加高斯噪声，但这种扰动在深层网络中变得不可控
ConVis方法：需要调用昂贵的额外模型来增强视觉信息
提示工程方法：效果有限且不够通用
后训练方法：需要人工反馈数据和额外训练成本

核心贡献

提出CMG方法：一种无需训练的推理方法，通过随机注意力掩蔽有效减少模型幻觉
识别幻觉根因：发现视觉-注意力连接不足是幻觉产生的重要原因，并提供严格证据
全面实验验证：在多个基准测试上量化评估CMG的有效性，展现其泛化能力
理论框架完善：基于点互信息(PMI)建立了对比解码的理论基础

方法详解

任务定义

给定文本输入 $x = \{x_1, x_2, ..., x_n\}$ 和视觉输入 $I = \{I_1, I_2, ..., I_m\}$ ，VLM需要生成长度为k的文本序列 $y = \{y_1, y_2, ..., y_k\}$ 。生成过程遵循自回归模式：

$p_\theta(y|x,I) = \prod_{t=1}^k p_\theta(y_t|y_{<t}, x, I)$

语言偏见分析

研究发现VLMs中存在显著的语言偏见：

注意力权重衰减：图像token的注意力权重在浅层急剧下降，在深层保持低水平
文本token优势：系统token的注意力权重甚至超过包含关键信息的问题token
序列长度影响：随着生成序列变长，图像注意力权重逐渐减少

CMG核心架构

1. 业余模型构建

自注意力机制包含三种类型：

视觉内注意力 $A_{iv}$
文本内注意力 $A_{it}$
跨模态注意力 $A_{cr}$

$A = A_{iv} \cup A_{it} \cup A_{cr}$

通过掩蔽部分跨模态和视觉内注意力权重构建业余模型：

$SA(Q,K,V;M) = \text{Softmax}(A \odot M)V$

其中 $M := M_{cr} \cup M_{iv}$ 是施加在注意力图上的掩蔽。

2. 对比解码策略

调整原始VLM的输出分布：

$p_\theta(y|x,I) \propto q_\theta(y) \left(\frac{q_\theta(y)}{q_\theta(y;M)}\right)^\alpha$

其中：

$q_\theta(y) := p_\theta(y|x,I;A_{cr}, A_{iv}, A_{it})$ (原始模型)
$q_\theta(y;M) := p_\theta(y|x,I;A_{cr} \odot M_{cr}, A_{iv} \odot M_{iv}, A_{it})$ (业余模型)

3. 动态掩蔽策略

动态注意力掩蔽：掩蔽 $A_{iv}$ 和 $A_{cr}$ 中最大的 $\gamma$ 比例的注意力权重：

$SA(Q,K,V;M) = \text{Softmax}(A \odot M(\gamma))V$

动态层选择：基于余弦相似度选择重要层：

$s(i) = \cos(X_i, Y_i) = \frac{X_i \cdot Y_i}{\|X_i\|_2 \|Y_i\|_2}$

选择相似度最小的 $\tau$ 比例的层进行掩蔽。

技术创新点

内部注意力机制操作：直接操作transformer内部的注意力权重，而非输入扰动
自适应掩蔽策略：动态选择最具影响力的注意力权重和层进行掩蔽
理论驱动设计：基于PMI理论构建对比解码框架
无训练成本：完全在推理阶段工作，无需额外训练

实验设置

数据集

幻觉相关基准：HallusionBench、POPE
综合评估基准：MME

评价指标

POPE：召回率(Recall)、准确率(Accuracy)、精确率(Precision)、总体得分(Overall)
HallusionBench：问题对准确率(qAcc)、图像准确率(fAcc)、总体准确率(aAcc)
MME：感知和推理能力的14个子任务得分

对比方法

VCD：通过向输入图像添加高斯噪声构建业余模型
ConVis：使用文本到图像模型重新生成图像并利用差异指导生成

实现细节

骨干模型：LLaVA-v1.5-7B、InstructBLIP-7B、Qwen2-VL-7B、InternVL2.5-8B
参数设置：
- 幻觉特定基准： $\alpha=0.3, \gamma=0.5, \tau=0.5$
- 通用基准MME： $\alpha=0.1, \gamma=0.5, \tau=0.1$
采样参数：top-p=0.9, beam search=5, temperature=0.7