2025-11-20T05:37:14.741052

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Saraf, Boroujeni, Beaudry et al.

Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

academic

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Basic Information

Paper ID: 2508.21164
Title: Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
Authors: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
Classification: cs.CL, cs.AI
Publication Date: October 9, 2025 (arXiv v3)
Paper Link: https://arxiv.org/abs/2508.21164v3

Abstract

This study investigates systematic biases in self-evaluation and cross-evaluation among three mainstream large language models (ChatGPT, Gemini, and Claude). The research employs a controlled experiment where each model evaluates blog articles generated by various models under four label conditions (no label, true label, and two false label scenarios). Evaluation employs both holistic preference voting and fine-grained quality scoring across three dimensions (coherence, informativeness, conciseness), with all scores normalized to percentages for direct comparison. The study reveals significant asymmetries in model judgment: the "Claude" label consistently elevates scores regardless of actual authorship, while the "Gemini" label systematically depresses scores. False labels frequently reverse preference rankings, producing variations of up to 50 percentage points in voting results and up to 12 percentage points in quality scores.

Research Background and Motivation

Core Research Questions

As large language models are increasingly deployed as text quality assessment tools, the validity of their judgments remains insufficiently explored. This study addresses the following key questions:

LLM Evaluation Bias Problem: Can LLMs fairly evaluate outputs, or are they influenced by perceived authorship?
Label-Induced Bias: Do model names affect evaluation results independent of actual quality?
Self-Preference Bias: Do models tend to assign higher scores to their own outputs?

Significance

The importance of this problem is evident in:

The growing prevalence of the LLM-as-judge paradigm in automated text evaluation
How evaluation bias can distort benchmark results
Impact on fairness in model comparison and selection
Challenges to the reliability and transparency of AI systems

Limitations of Existing Research

Existing studies primarily focus on single types of bias or limited model counts, lacking:

Controlled comparative analysis across multiple models and conditions
Quantitative evidence comparing label effects across preference and quality dimensions
Systematic recommendations for bias mitigation

Core Contributions

Controlled Multi-Condition Analysis: Provides a controlled, multi-condition analytical framework for self- and cross-model evaluation bias
Quantitative Bias Evidence: Furnishes quantitative evidence comparing label effects across preference and quality dimensions
Bias Mitigation Recommendations: Offers suggestions for mitigating bias through blind evaluation or multi-model evaluation protocols
Dual Scoring Methodology: Employs complementary percentage preference scoring and point-based quality scoring methods
Label Asymmetry Findings: Discovers that "Claude" labels consistently elevate scores while "Gemini" labels systematically depress scores

Methodology Details

Experimental Design

This study employs a three-stage controlled multi-model, multi-condition design:

Stage 1: Blog Generation

Models: ChatGPT-4o, Gemini 2.5 Flash, Claude Sonnet 4
Task: Generate approximately 200-word blog articles using fixed prompt templates
Prompt Template: "You are a professional blog writer. Write a concise blog post (around 200 words) for the title ''. The style should be engaging and suitable for an online audience. Return only the blog content, no extra text."
Data: 10 different topic titles, with each model generating one blog per title, totaling 30 blogs

Stage 2: Label Condition Setup

Four label conditions:

No Label: No author attribution
True Label: Correct attribution
False Label Scenario 1: ChatGPT labeled as Gemini, Gemini as Claude, Claude as ChatGPT
False Label Scenario 2: ChatGPT labeled as Claude, Gemini as ChatGPT, Claude as Gemini

Stage 3: Dual Scoring System

Percentage Preference Scoring: Measures the frequency each output is selected as "best"
Point-Based Quality Scoring: Scores 0-10 across three dimensions (coherence, informativeness, conciseness), converted to percentages

Analysis Levels

Within-Condition Analysis: Comparisons within conditions
Cross-Condition Analysis: Tracking changes across conditions
Metric-Specific Analysis: Examining bias effects on each criterion

Experimental Setup

Dataset Characteristics

Scale: 30 blog articles (3 models × 10 titles)
Topics: Diverse topics with similar complexity
Length: Approximately 200 words, suitable for online audiences

Evaluation Metrics

Overall Preference Voting: Percentage frequency of "best choice" selection
Quality Dimension Scoring:
- Coherence: Logical structure and fluency of the article
- Informativeness: Information value and depth of content
- Conciseness: Efficiency and precision of expression

Comparison Conditions

No-label condition as baseline
True-label condition
Two false-label scenarios

Experimental Results

Major Findings

No-Label Condition Baseline

All three models exhibit slight self-preference
ChatGPT self-selection frequency: 50%
Gemini: 45.3%
Claude: 46.7%
Gemini consistently underestimated in cross-model scoring (7%-12%)

Bias Amplification Under True-Label Condition

Claude Self-Preference Enhancement: Self-evaluation score increases from 46.7% to 60%
Gemini Severe Self-Deprecation:
- Scoring from Claude: 0%
- Scoring from ChatGPT: 1.34%
- Self-scoring: 11.32%
ChatGPT Moderate Self-Preference: 44.66%, but severely penalizes Gemini

Strong Impact of False Labels

Scenario 1 Results:

Gemini preference for content labeled as Claude increases from 11.32% to 51.35%
Claude preference for content labeled as ChatGPT reaches 54.15%
Informativeness scores increase 8-10 percentage points under false "self" labels

Scenario 2 Results:

"Claude" label produces highest single-item scores: Gemini rates ChatGPT-as-Claude at 60.7%
"Gemini" label again depresses scores: Claude-as-Gemini drops from 60% under true label to 18.48%

Quantitative Bias Effects

Preference Voting Variation: Swings of up to 50 percentage points
Quality Score Variation: Changes of up to 12 percentage points
Most Sensitive Dimension: Informativeness scores most sensitive to labels
Most Stable Dimension: Conciseness scores relatively stable

Model-Specific Behavioral Patterns

Claude: Strongest self-preference under true label (+13 points), severe penalty when mislabeled as Gemini (-28 points)
Gemini: Harsh self-assessment under true label, but substantial bonus for "Claude"-labeled content (up to +21 points)
ChatGPT: Consistent penalty for Gemini-labeled content across conditions

Self-Preference Bias Research

Panickssery et al. demonstrate LLM preference for own outputs with measurable self-recognition capability
Wataoka et al. investigate self-preference bias in LLM-as-judge contexts

Label-Induced Evaluation Bias

Wang et al. demonstrate systematic bias based on response position can manipulate rankings
Chen et al. investigate whether self-preference reflects true superiority or signals bias

Evaluation Dynamics Research

Inconsistencies between implicit and explicit evaluation dynamics
Structural bias issues in deep learning systems

Conclusions and Discussion

Main Conclusions

Label Identity Supersedes Content Quality: Perceived model identity can significantly distort judgment independent of actual content quality
Asymmetric Label Effects: "Claude" labels consistently elevate scores while "Gemini" labels systematically depress scores
Evaluation Level Differences: High-level "best choice" judgments are more susceptible to bias than detailed quality assessments
Dimension Sensitivity Differences: Informativeness is most susceptible to label influence; conciseness is relatively stable

Limitations

Limited Model Scope: Only three models studied; generalizability requires verification
Single Task Domain: Only blog writing tasks employed
Limited Evaluation Dimensions: Only three quality dimensions considered
Unexplored Bias Sources: Insufficient investigation into training data or alignment procedure origins of bias

Practical Recommendations

Blind Evaluation Protocols: Conceal model identity to prevent anchoring based on model names
Multi-Model Consensus: Employ multi-model or consensus-based evaluation systems
Separate Evaluation Types: Distinguish preference judgments from detailed quality scoring
Bias-Aware Adjustments: Develop bias-aware score adjustment mechanisms

In-Depth Evaluation

Strengths

Rigorous Experimental Design: Controlled multi-condition, multi-model design ensures result reliability
Methodological Innovation: Dual scoring system (preference + quality) provides comprehensive perspective
Significant Findings: Reveals systematic bias in LLM evaluation with important implications for AI assessment
Comprehensive Quantitative Analysis: Provides detailed numerical evidence and statistical analysis
High Practical Value: Offers concrete recommendations for improving LLM evaluation

Weaknesses

Limited Sample Size: 30 blog articles represent relatively small sample
Single Task Type: Limited to blog writing; lacks task diversity validation
Unexplored Bias Mechanisms: Insufficient investigation into root causes of asymmetric bias
Unknown Long-Term Effects: Does not consider temporal bias pattern changes

Impact Assessment

Academic Contribution: Provides important empirical evidence for LLM evaluation bias research
Practical Value: Directly impacts LLM benchmark design and evaluation protocol development
Policy Significance: Provides scientific basis for AI system fairness and transparency policies
Reproducibility: Clear methodology description facilitates reproduction and extension

Applicable Scenarios

LLM Benchmarking: Improves fairness of existing evaluation frameworks
Automated Evaluation Systems: Designs unbiased text quality assessment tools
Model Comparison Research: Ensures objectivity in model performance comparison
AI Ethics Research: Provides methods for AI system bias detection and mitigation

Future Research Directions

Extended Model Coverage: Include more LLMs for broader bias pattern investigation
Multi-Task Validation: Verify label effect generalization across different task types
Bias Source Exploration: Investigate training data and alignment procedures' influence on bias formation
Mitigation Strategy Development: Design and test more effective bias mitigation techniques
Dynamic Bias Research: Study bias pattern changes over time and model updates

Summary: Through rigorous experimental design, this study reveals serious label-induced bias in LLM evaluation, providing important scientific evidence for improving fairness and reliability of AI assessment. The findings possess not only significant academic value but also direct guidance for practical AI system deployment and evaluation.