2025-11-24T04:01:17.739487

Large Language Models are overconfident and amplify human bias

Sun, Li, Wang et al.
Large language models (LLMs) are revolutionizing every aspect of society. They are increasingly used in problem-solving tasks to substitute human assessment and reasoning. LLMs are trained on what humans write and are thus exposed to human bias. We evaluate whether LLMs inherit one of the most widespread human biases: overconfidence. We algorithmically construct reasoning problems with known ground truths. We prompt LLMs to answer these problems and assess the confidence in their answers, closely following similar protocols in human experiments. We find that all five LLMs we study are overconfident: they overestimate the probability that their answer is correct between 20% and 60%. Humans have accuracy similar to the more advanced LLMs, but far lower overconfidence. Although humans and LLMs are similarly biased in questions which they are certain they answered correctly, a key difference emerges between them: LLM bias increases sharply relative to humans if they become less sure that their answers are correct. We also show that LLM input has ambiguous effects on human decision making: LLM input leads to an increase in the accuracy, but it more than doubles the extent of overconfidence in the answers.
academic

Large Language Models are Overconfident and Amplify Human Bias

Basic Information

  • Paper ID: 2505.02151
  • Title: Large Language Models are Overconfident and Amplify Human Bias
  • Authors: Fengfei Sun, Ningke Li, Kailong Wang, Lorenz Goette
  • Classification: cs.SE (Software Engineering), cs.CY (Computers and Society)
  • Publication Date: May 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2505.02151v2

Abstract

Large Language Models (LLMs) are fundamentally transforming various aspects of society and are increasingly being deployed for problem-solving tasks that replace human evaluation and reasoning. Since LLMs are trained on human-authored content, they are exposed to human biases. This study evaluates whether LLMs have inherited one of the most prevalent human biases: overconfidence. The researchers algorithmically constructed reasoning problems with known ground truth answers, prompted LLMs to answer these questions, and assessed the confidence levels of their responses. The findings reveal that all five LLMs studied exhibit significant overconfidence: they overestimate the probability of their answers being correct by 20% to 60%. While human accuracy is comparable to more advanced LLMs, humans display substantially lower levels of overconfidence. When LLMs are less certain about their answers, their bias relative to humans increases dramatically. The study also demonstrates that LLM inputs have complex effects on human decision-making: while improving accuracy, they more than double the level of overconfidence.

Research Background and Motivation

Problem Definition

The core research question addressed is: Do large language models inherit and amplify human overconfidence bias? This question is significant for several reasons:

  1. Widespread Application Scenarios: LLMs are increasingly deployed for problem-solving tasks requiring careful reasoning and evaluation
  2. Training Data Bias: LLMs trained on human-authored content are inherently exposed to human biases
  3. Decision-Making Impact: Overconfidence has been demonstrated to affect multiple domains of professional and everyday decision-making

Research Significance

Overconfidence is one of the most prevalent biases in human judgment and has produced negative effects across multiple domains:

  • Professional Domains: Overconfident managers are more likely to pursue unprofitable mergers and acquisitions
  • Daily Behavior: Influences exercise habits, dietary choices, and financial investment decisions
  • Learning Capacity: May lead to persistent bias rather than learning from feedback

Limitations of Existing Research

Current research on LLM calibration primarily suffers from the following issues:

  1. Relies mainly on standard question-answering datasets, which LLMs likely encountered during training
  2. Lacks investigation of confidence levels for questions requiring reasoning abilities
  3. Insufficiently explores the impact of LLM confidence on human decision-making

Core Contributions

  1. First Systematic Assessment: Comprehensive evaluation of overconfidence bias in five mainstream LLMs
  2. Innovative Experimental Design: Construction of 10,000 algorithmically-generated reasoning problems to ensure minimal training contamination
  3. Human-Machine Comparative Analysis: Direct comparison of LLM and human performance on identical tasks
  4. Confidence Gradient Findings: Revelation of the "Dunning-Kruger effect" where LLM bias increases dramatically under uncertainty
  5. Human Decision-Making Impact Study: Quantification of dual effects of LLM input on human accuracy and bias
  6. Welfare Effect Analysis: Establishment of theoretical models analyzing welfare impacts of LLM exposure

Methodology Details

Task Definition

The study designed three interconnected experiments:

  1. LLM Overconfidence Assessment: Measuring accuracy and confidence levels of LLMs on reasoning tasks
  2. Human Benchmark Testing: Evaluating human performance on identical tasks
  3. LLM Exposure Experiment: Testing the impact of LLM input on human decision-making

Problem Generation Method

Triple Extraction

Extraction of structured triples (subject, predicate, object) from Wikidata, covering ten popular categories.

Logical Reasoning Rules

Implementation of five reasoning types:

  1. Negation Reasoning: Deriving the validity of negations from factual knowledge
  2. Symmetric Reasoning: Exchanging subject and object in symmetric relations
  3. Inverse Reasoning: Connecting subject and object through inverse relations
  4. Transitive Reasoning: Chain reasoning to generate new triples
  5. Composite Reasoning: Combining multiple basic reasoning rules

Problem Validation

Automatic reasoning using Prolog inference engine, manual validation of predicate components, ultimately retaining 476 predicates and corresponding triples.

Confidence Measurement

Using specially designed prompts to simultaneously obtain:

  • Confidence in answer correctness
  • Confidence in factual knowledge correctness
  • Confidence in reasoning process correctness

Similarity Assessment

Development of algorithms to compute similarity between LLM responses and standard answers:

  • Factual Similarity: Based on subject matching and object similarity
  • Reasoning Similarity: Evaluating predicate and object matching

Experimental Setup

Dataset

  • Scale: 10,000 balanced reasoning problems
  • Distribution: 5 reasoning types × 10 knowledge domains, 200 problems per combination
  • Human Benchmark: 2,000 problems selected for human experiments

Model Selection

Five representative LLMs tested:

  • Closed-source Models: GPT-3.5, GPT-4o, GPT-o1
  • Open-source Models: Llama 3.1 8B, Llama 3.2 3B

Evaluation Metrics

  • Accuracy: Proportion of correct answers
  • Confidence: Model's self-reported probability of correctness
  • Bias: Difference between confidence and accuracy
  • Confidence Gradient: Rate of change in accuracy relative to confidence

Human Experiment Design

  • Platform: Prolific online experimental platform
  • Incentive Mechanism: Following the real incentive mechanism of Danz et al. (2022)
  • Sample: 588 participants in baseline experiment, 1,161 in exposure experiment

Experimental Results

LLM Overconfidence Performance

Main Findings

All five LLMs exhibited significant overconfidence:

  • GPT-3.5: Accuracy 35%, Confidence 94%, Bias 59%
  • GPT-4o: Accuracy 63%, Confidence 94%, Bias 30%
  • GPT-o1: Accuracy 73%, Confidence 95%, Bias 22%
  • Llama 3.1: Accuracy 63%, Confidence 86%, Bias 23%
  • Llama 3.2: Accuracy 61%, Confidence 94%, Bias 33%

Confidence Gradient Analysis

More advanced models display stronger confidence gradients:

  • GPT-4o and GPT-o1: 10% decrease in confidence corresponds to approximately 25% decrease in accuracy
  • Llama 3.1: 10% decrease in confidence corresponds to approximately 13% decrease in accuracy

Human-Machine Comparison Results

Performance Comparison

  • Human Accuracy: 66% (comparable to GPT-4o and Llama 3.1)
  • Human Confidence: 70% (only 4% overconfidence)
  • Key Difference: Humans show reduced bias when uncertain; LLMs show the opposite

Dunning-Kruger Effect

LLMs exhibit a stronger Dunning-Kruger effect than humans:

  • When completely confident, LLM accuracy is 79-85% (still 15-21% bias)
  • Humans show slight underestimation when uncertain (54% accuracy vs. 50% expected)

Impact of LLM Exposure on Humans

Accuracy Improvement

  • LLM Answer Group: 5.6 percentage point accuracy improvement
  • LLM Answer + Confidence Group: 7.0 percentage point accuracy improvement

Bias Amplification

  • LLM Answer Group: 4.2 percentage point bias increase (doubled)
  • LLM Answer + Confidence Group: 7.6 percentage point bias increase (nearly tripled)

Heterogeneous Effects

Participants with low baseline confidence benefit most:

  • Accuracy improvement of 8.6-11.9 percentage points
  • But bias also increases by 7.0-14.1 percentage points

LLM Calibration Research

Existing research primarily employs three methods for measuring LLM confidence:

  1. Logit-based Estimation: Requires internal model access
  2. Direct Confidence Elicitation: Direct questioning through prompts
  3. Auxiliary Model Approach: From single-model prediction to multi-source integration

The innovation of this research lies in using algorithmically-generated problems to ensure minimal training contamination.

Overconfidence Research

Effects of overconfidence across multiple domains:

  • Corporate Decision-Making: Influences financing choices and M&A decisions
  • Personal Behavior: Influences health choices and investment decisions
  • Learning Processes: May lead to persistent bias rather than adaptive learning

Human-Machine Interaction

Emerging research explores how individuals respond to (potentially biased) AI input, and this study makes important contributions to this field.

Conclusions and Discussion

Main Conclusions

  1. Universal Overconfidence: All tested LLMs exhibit significant overconfidence, far exceeding human levels
  2. Dunning-Kruger Effect: LLM bias increases dramatically under uncertainty, lacking awareness of knowledge boundaries
  3. Dual Impact: While LLM input improves human accuracy, it significantly increases overconfidence
  4. Welfare Complexity: In environments requiring investment decisions, increased bias may offset accuracy gains

Theoretical Insights

Dunning-Kruger Mechanism

LLMs are "trapped" within their prediction models:

  • Unable to perceive knowledge absent from training data
  • Form accuracy estimates based on training data
  • Lack human intuitive recognition of knowledge limitations

Welfare Theoretical Model

Establishment of welfare models considering both accuracy and bias:

  • When investments have high elasticity to success probability, negative effects of overconfidence are greater
  • Even with improved accuracy, LLM exposure may reduce overall welfare

Limitations

  1. Task Scope: Limited to binary-choice reasoning problems
  2. Model Versions: Results may change with model updates
  3. Cultural Differences: Human experiments primarily based on English speakers
  4. Temporal Effects: Does not consider long-term learning and adaptation effects

Practical Implications

Guidance for Users

  • Provides new benchmarks for evaluating LLM reasoning capabilities
  • Emphasizes the need for appropriate skepticism toward LLM recommendations

Recommendations for Developers

  • Current training objectives prioritize fluency over accuracy
  • Need to develop built-in uncertainty correction mechanisms
  • Recommend integrating verification mechanisms to check reasoning processes

Inspiration for Research

  • Emphasizes the importance of evaluating behavioral biases in LLMs
  • Provides a paradigm for research on other cognitive biases
  • Promotes interdisciplinary collaboration between behavioral science and computer science

In-Depth Evaluation

Strengths

  1. Methodological Innovation:
    • Algorithmically-generated problems minimize training contamination
    • Multi-dimensional confidence measurement (answer, fact, reasoning)
    • Rigorous human-machine comparative experimental design
  2. Experimental Sufficiency:
    • Large-scale experiments (10,000 LLM problems, 5,000+ human responses)
    • Robustness checks across multiple models and temperature settings
    • Detailed ablation studies and reproducibility verification
  3. Theoretical Contributions:
    • First revelation of Dunning-Kruger effect in LLMs
    • Establishment of welfare analysis framework for LLM exposure
    • New perspective on confidence calibration
  4. Practical Value:
    • Important safety considerations for LLM applications
    • Direct guidance for AI system design
    • Scientific evidence for regulatory policy formulation

Limitations

  1. Task Limitations:
    • Considers only binary-choice problems, may not fully represent real-world application scenarios
    • Relatively simple reasoning types, lacking more complex multi-step reasoning
  2. Measurement Methods:
    • Confidence measurement relies on self-report, potentially subject to prompt sensitivity
    • Similarity assessment algorithm may introduce subjectivity
  3. Sample Representativeness:
    • Human experiments primarily based on online platform users
    • Lacks diversity across different cultural backgrounds and professional domains
  4. Long-term Effects:
    • Does not consider learning effects from repeated exposure
    • Lacks ecological validity verification in actual decision-making environments

Impact Assessment

Academic Impact

  • Theoretical Contribution: Opens new directions for research on LLM behavioral biases
  • Methodological Value: Provides replicable experimental paradigm
  • Interdisciplinary Significance: Connects AI, cognitive science, and behavioral economics

Practical Impact

  • Industry Application: Influences LLM product design and deployment strategies
  • Educational Value: Increases public awareness of AI system limitations
  • Policy Formulation: Provides scientific evidence for AI governance

Applicable Scenarios

  1. High-Risk Decision-Making: Medical diagnosis, financial investment and other scenarios requiring accuracy assessment
  2. Educational Applications: Need to consider overconfidence effects on learning outcomes
  3. Human-Machine Collaboration: Design better confidence communication mechanisms
  4. AI Safety: Develop more reliable uncertainty quantification methods

Future Research Directions

  1. Extended Task Types: Research more complex reasoning tasks and open-ended questions
  2. Cross-Cultural Validation: Verify universality of findings across different cultural backgrounds
  3. Intervention Mechanisms: Develop training and prompting methods to reduce overconfidence
  4. Long-term Effects: Study learning and adaptation processes in repeated interactions
  5. Other Biases: Systematically investigate other cognitive biases in LLMs

References

The paper cites rich relevant literature covering:

  • Overconfidence research in behavioral economics (Kahneman, 2011; Moore and Healy, 2008)
  • LLM calibration and uncertainty quantification (Tian et al., 2023; Wei et al., 2024)
  • Human-machine interaction and AI bias (Barocas and Selbst, 2016; Rambachan and Roth, 2020)
  • Classical research on the Dunning-Kruger effect (Kruger and Dunning, 1999)

This research provides important insights for understanding and improving the reliability of large language models, with profound implications for AI safety and human-machine collaboration. By revealing the overconfidence problem in LLMs, the study points the way toward developing more trustworthy AI systems.