2025-11-20T05:37:14.741052

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Saraf, Boroujeni, Beaudry et al.
Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.
academic

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Basic Information

  • Paper ID: 2508.21164
  • Title: Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
  • Authors: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
  • Classification: cs.CL, cs.AI
  • Publication Date: October 9, 2025 (arXiv v3)
  • Paper Link: https://arxiv.org/abs/2508.21164v3

Abstract

This study investigates systematic biases in self-evaluation and cross-evaluation among three mainstream large language models (ChatGPT, Gemini, and Claude). The research employs a controlled experiment where each model evaluates blog articles generated by various models under four label conditions (no label, true label, and two false label scenarios). Evaluation employs both holistic preference voting and fine-grained quality scoring across three dimensions (coherence, informativeness, conciseness), with all scores normalized to percentages for direct comparison. The study reveals significant asymmetries in model judgment: the "Claude" label consistently elevates scores regardless of actual authorship, while the "Gemini" label systematically depresses scores. False labels frequently reverse preference rankings, producing variations of up to 50 percentage points in voting results and up to 12 percentage points in quality scores.

Research Background and Motivation

Core Research Questions

As large language models are increasingly deployed as text quality assessment tools, the validity of their judgments remains insufficiently explored. This study addresses the following key questions:

  1. LLM Evaluation Bias Problem: Can LLMs fairly evaluate outputs, or are they influenced by perceived authorship?
  2. Label-Induced Bias: Do model names affect evaluation results independent of actual quality?
  3. Self-Preference Bias: Do models tend to assign higher scores to their own outputs?

Significance

The importance of this problem is evident in:

  • The growing prevalence of the LLM-as-judge paradigm in automated text evaluation
  • How evaluation bias can distort benchmark results
  • Impact on fairness in model comparison and selection
  • Challenges to the reliability and transparency of AI systems

Limitations of Existing Research

Existing studies primarily focus on single types of bias or limited model counts, lacking:

  1. Controlled comparative analysis across multiple models and conditions
  2. Quantitative evidence comparing label effects across preference and quality dimensions
  3. Systematic recommendations for bias mitigation

Core Contributions

  1. Controlled Multi-Condition Analysis: Provides a controlled, multi-condition analytical framework for self- and cross-model evaluation bias
  2. Quantitative Bias Evidence: Furnishes quantitative evidence comparing label effects across preference and quality dimensions
  3. Bias Mitigation Recommendations: Offers suggestions for mitigating bias through blind evaluation or multi-model evaluation protocols
  4. Dual Scoring Methodology: Employs complementary percentage preference scoring and point-based quality scoring methods
  5. Label Asymmetry Findings: Discovers that "Claude" labels consistently elevate scores while "Gemini" labels systematically depress scores

Methodology Details

Experimental Design

This study employs a three-stage controlled multi-model, multi-condition design:

Stage 1: Blog Generation

  • Models: ChatGPT-4o, Gemini 2.5 Flash, Claude Sonnet 4
  • Task: Generate approximately 200-word blog articles using fixed prompt templates
  • Prompt Template: "You are a professional blog writer. Write a concise blog post (around 200 words) for the title ''. The style should be engaging and suitable for an online audience. Return only the blog content, no extra text."
  • Data: 10 different topic titles, with each model generating one blog per title, totaling 30 blogs

Stage 2: Label Condition Setup

Four label conditions:

  1. No Label: No author attribution
  2. True Label: Correct attribution
  3. False Label Scenario 1: ChatGPT labeled as Gemini, Gemini as Claude, Claude as ChatGPT
  4. False Label Scenario 2: ChatGPT labeled as Claude, Gemini as ChatGPT, Claude as Gemini

Stage 3: Dual Scoring System

  1. Percentage Preference Scoring: Measures the frequency each output is selected as "best"
  2. Point-Based Quality Scoring: Scores 0-10 across three dimensions (coherence, informativeness, conciseness), converted to percentages

Analysis Levels

  • Within-Condition Analysis: Comparisons within conditions
  • Cross-Condition Analysis: Tracking changes across conditions
  • Metric-Specific Analysis: Examining bias effects on each criterion

Experimental Setup

Dataset Characteristics

  • Scale: 30 blog articles (3 models × 10 titles)
  • Topics: Diverse topics with similar complexity
  • Length: Approximately 200 words, suitable for online audiences

Evaluation Metrics

  1. Overall Preference Voting: Percentage frequency of "best choice" selection
  2. Quality Dimension Scoring:
    • Coherence: Logical structure and fluency of the article
    • Informativeness: Information value and depth of content
    • Conciseness: Efficiency and precision of expression

Comparison Conditions

  • No-label condition as baseline
  • True-label condition
  • Two false-label scenarios

Experimental Results

Major Findings

No-Label Condition Baseline

  • All three models exhibit slight self-preference
  • ChatGPT self-selection frequency: 50%
  • Gemini: 45.3%
  • Claude: 46.7%
  • Gemini consistently underestimated in cross-model scoring (7%-12%)

Bias Amplification Under True-Label Condition

  • Claude Self-Preference Enhancement: Self-evaluation score increases from 46.7% to 60%
  • Gemini Severe Self-Deprecation:
    • Scoring from Claude: 0%
    • Scoring from ChatGPT: 1.34%
    • Self-scoring: 11.32%
  • ChatGPT Moderate Self-Preference: 44.66%, but severely penalizes Gemini

Strong Impact of False Labels

Scenario 1 Results:

  • Gemini preference for content labeled as Claude increases from 11.32% to 51.35%
  • Claude preference for content labeled as ChatGPT reaches 54.15%
  • Informativeness scores increase 8-10 percentage points under false "self" labels

Scenario 2 Results:

  • "Claude" label produces highest single-item scores: Gemini rates ChatGPT-as-Claude at 60.7%
  • "Gemini" label again depresses scores: Claude-as-Gemini drops from 60% under true label to 18.48%

Quantitative Bias Effects

  • Preference Voting Variation: Swings of up to 50 percentage points
  • Quality Score Variation: Changes of up to 12 percentage points
  • Most Sensitive Dimension: Informativeness scores most sensitive to labels
  • Most Stable Dimension: Conciseness scores relatively stable

Model-Specific Behavioral Patterns

  1. Claude: Strongest self-preference under true label (+13 points), severe penalty when mislabeled as Gemini (-28 points)
  2. Gemini: Harsh self-assessment under true label, but substantial bonus for "Claude"-labeled content (up to +21 points)
  3. ChatGPT: Consistent penalty for Gemini-labeled content across conditions

Self-Preference Bias Research

  • Panickssery et al. demonstrate LLM preference for own outputs with measurable self-recognition capability
  • Wataoka et al. investigate self-preference bias in LLM-as-judge contexts

Label-Induced Evaluation Bias

  • Wang et al. demonstrate systematic bias based on response position can manipulate rankings
  • Chen et al. investigate whether self-preference reflects true superiority or signals bias

Evaluation Dynamics Research

  • Inconsistencies between implicit and explicit evaluation dynamics
  • Structural bias issues in deep learning systems

Conclusions and Discussion

Main Conclusions

  1. Label Identity Supersedes Content Quality: Perceived model identity can significantly distort judgment independent of actual content quality
  2. Asymmetric Label Effects: "Claude" labels consistently elevate scores while "Gemini" labels systematically depress scores
  3. Evaluation Level Differences: High-level "best choice" judgments are more susceptible to bias than detailed quality assessments
  4. Dimension Sensitivity Differences: Informativeness is most susceptible to label influence; conciseness is relatively stable

Limitations

  1. Limited Model Scope: Only three models studied; generalizability requires verification
  2. Single Task Domain: Only blog writing tasks employed
  3. Limited Evaluation Dimensions: Only three quality dimensions considered
  4. Unexplored Bias Sources: Insufficient investigation into training data or alignment procedure origins of bias

Practical Recommendations

  1. Blind Evaluation Protocols: Conceal model identity to prevent anchoring based on model names
  2. Multi-Model Consensus: Employ multi-model or consensus-based evaluation systems
  3. Separate Evaluation Types: Distinguish preference judgments from detailed quality scoring
  4. Bias-Aware Adjustments: Develop bias-aware score adjustment mechanisms

In-Depth Evaluation

Strengths

  1. Rigorous Experimental Design: Controlled multi-condition, multi-model design ensures result reliability
  2. Methodological Innovation: Dual scoring system (preference + quality) provides comprehensive perspective
  3. Significant Findings: Reveals systematic bias in LLM evaluation with important implications for AI assessment
  4. Comprehensive Quantitative Analysis: Provides detailed numerical evidence and statistical analysis
  5. High Practical Value: Offers concrete recommendations for improving LLM evaluation

Weaknesses

  1. Limited Sample Size: 30 blog articles represent relatively small sample
  2. Single Task Type: Limited to blog writing; lacks task diversity validation
  3. Unexplored Bias Mechanisms: Insufficient investigation into root causes of asymmetric bias
  4. Unknown Long-Term Effects: Does not consider temporal bias pattern changes

Impact Assessment

  1. Academic Contribution: Provides important empirical evidence for LLM evaluation bias research
  2. Practical Value: Directly impacts LLM benchmark design and evaluation protocol development
  3. Policy Significance: Provides scientific basis for AI system fairness and transparency policies
  4. Reproducibility: Clear methodology description facilitates reproduction and extension

Applicable Scenarios

  1. LLM Benchmarking: Improves fairness of existing evaluation frameworks
  2. Automated Evaluation Systems: Designs unbiased text quality assessment tools
  3. Model Comparison Research: Ensures objectivity in model performance comparison
  4. AI Ethics Research: Provides methods for AI system bias detection and mitigation

Future Research Directions

  1. Extended Model Coverage: Include more LLMs for broader bias pattern investigation
  2. Multi-Task Validation: Verify label effect generalization across different task types
  3. Bias Source Exploration: Investigate training data and alignment procedures' influence on bias formation
  4. Mitigation Strategy Development: Design and test more effective bias mitigation techniques
  5. Dynamic Bias Research: Study bias pattern changes over time and model updates

Summary: Through rigorous experimental design, this study reveals serious label-induced bias in LLM evaluation, providing important scientific evidence for improving fairness and reliability of AI assessment. The findings possess not only significant academic value but also direct guidance for practical AI system deployment and evaluation.