2025-11-20T05:37:14.741052

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Saraf, Boroujeni, Beaudry et al.

Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

academic

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

基本信息

论文ID: 2508.21164
标题: Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
作者: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
分类: cs.CL, cs.AI
发表时间: 2025年10月9日 (arXiv v3)
论文链接: https://arxiv.org/abs/2508.21164v3

摘要

本研究调查了三个主流大语言模型（ChatGPT、Gemini和Claude）在自我评估和交叉评估中的系统性偏见。研究设计了一个受控实验，让每个模型在四种标签条件下（无标签、真实标签、两种虚假标签场景）评估由各模型生成的博客文章。评估采用整体偏好投票和三个维度（连贯性、信息性、简洁性）的细粒度质量评分，所有分数标准化为百分比以便直接比较。研究发现模型判断存在显著不对称性："Claude"标签无论实际作者是谁都会提升分数，而"Gemini"标签则系统性地降低分数。虚假标签经常逆转偏好排序，在投票结果中产生高达50个百分点的变化，在质量评分中产生高达12个百分点的变化。

研究背景与动机

核心问题

随着大语言模型越来越多地被部署为文本质量评估工具，其判断的有效性仍然缺乏充分探索。本研究主要解决以下问题：

LLM评估偏见问题：LLM能否公正地评估输出，还是会被感知的作者身份影响？
标签诱导偏见：模型名称是否会影响评估结果，而与实际质量无关？
自我偏好偏见：模型是否倾向于给自己的输出更高评分？

重要性

这个问题的重要性体现在：

LLM-as-judge范式在自动化文本评估中日益普及
评估偏见可能导致基准测试结果失真
影响模型比较和选择的公平性
对AI系统的可靠性和透明度构成挑战

现有研究局限

现有研究主要关注单一类型的偏见或有限的模型数量，缺乏：

多模型、多条件的受控对比分析
定量证据比较标签效应在偏好和质量维度上的差异
系统性的偏见缓解建议

核心贡献

受控多条件分析：提供了自我和交叉模型评估偏见的受控、多条件分析框架
定量偏见证据：提供了比较标签效应在偏好和质量维度上的定量证据
偏见缓解建议：为通过盲评或多模型评估协议缓解偏见提供了建议
双重评分方法：采用百分比偏好评分和基于点数的质量评分两种互补方法
标签不对称性发现：发现"Claude"标签一致性提升分数，"Gemini"标签系统性降低分数

方法详解

实验设计

本研究采用三阶段的受控多模型、多条件设计：

阶段1：博客生成

模型：ChatGPT-4o、Gemini 2.5 Flash、Claude Sonnet 4
任务：使用固定提示模板生成约200字的博客文章
提示模板："You are a professional blog writer. Write a concise blog post (around 200 words) for the title ''. The style should be engaging and suitable for an online audience. Return only the blog content, no extra text."
数据：10个不同主题标题，每个模型每个标题生成一篇博客，共30篇博客