2025-11-24T04:01:17.739487

Large Language Models are overconfident and amplify human bias

Sun, Li, Wang et al.

Large language models (LLMs) are revolutionizing every aspect of society. They are increasingly used in problem-solving tasks to substitute human assessment and reasoning. LLMs are trained on what humans write and are thus exposed to human bias. We evaluate whether LLMs inherit one of the most widespread human biases: overconfidence. We algorithmically construct reasoning problems with known ground truths. We prompt LLMs to answer these problems and assess the confidence in their answers, closely following similar protocols in human experiments. We find that all five LLMs we study are overconfident: they overestimate the probability that their answer is correct between 20% and 60%. Humans have accuracy similar to the more advanced LLMs, but far lower overconfidence. Although humans and LLMs are similarly biased in questions which they are certain they answered correctly, a key difference emerges between them: LLM bias increases sharply relative to humans if they become less sure that their answers are correct. We also show that LLM input has ambiguous effects on human decision making: LLM input leads to an increase in the accuracy, but it more than doubles the extent of overconfidence in the answers.

academic

Large Language Models are Overconfident and Amplify Human Bias

Basic Information

Paper ID: 2505.02151
Title: Large Language Models are Overconfident and Amplify Human Bias
Authors: Fengfei Sun, Ningke Li, Kailong Wang, Lorenz Goette
Classification: cs.SE (Software Engineering), cs.CY (Computers and Society)
Publication Date: May 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2505.02151v2

Abstract

Large Language Models (LLMs) are fundamentally transforming various aspects of society and are increasingly being deployed for problem-solving tasks that replace human evaluation and reasoning. Since LLMs are trained on human-authored content, they are exposed to human biases. This study evaluates whether LLMs have inherited one of the most prevalent human biases: overconfidence. The researchers algorithmically constructed reasoning problems with known ground truth answers, prompted LLMs to answer these questions, and assessed the confidence levels of their responses. The findings reveal that all five LLMs studied exhibit significant overconfidence: they overestimate the probability of their answers being correct by 20% to 60%. While human accuracy is comparable to more advanced LLMs, humans display substantially lower levels of overconfidence. When LLMs are less certain about their answers, their bias relative to humans increases dramatically. The study also demonstrates that LLM inputs have complex effects on human decision-making: while improving accuracy, they more than double the level of overconfidence.

Research Background and Motivation

Problem Definition

The core research question addressed is: Do large language models inherit and amplify human overconfidence bias? This question is significant for several reasons:

Widespread Application Scenarios: LLMs are increasingly deployed for problem-solving tasks requiring careful reasoning and evaluation
Training Data Bias: LLMs trained on human-authored content are inherently exposed to human biases
Decision-Making Impact: Overconfidence has been demonstrated to affect multiple domains of professional and everyday decision-making

Research Significance

Overconfidence is one of the most prevalent biases in human judgment and has produced negative effects across multiple domains:

Professional Domains: Overconfident managers are more likely to pursue unprofitable mergers and acquisitions
Daily Behavior: Influences exercise habits, dietary choices, and financial investment decisions
Learning Capacity: May lead to persistent bias rather than learning from feedback

Limitations of Existing Research

Current research on LLM calibration primarily suffers from the following issues:

Relies mainly on standard question-answering datasets, which LLMs likely encountered during training
Lacks investigation of confidence levels for questions requiring reasoning abilities
Insufficiently explores the impact of LLM confidence on human decision-making

Core Contributions

First Systematic Assessment: Comprehensive evaluation of overconfidence bias in five mainstream LLMs
Innovative Experimental Design: Construction of 10,000 algorithmically-generated reasoning problems to ensure minimal training contamination
Human-Machine Comparative Analysis: Direct comparison of LLM and human performance on identical tasks
Confidence Gradient Findings: Revelation of the "Dunning-Kruger effect" where LLM bias increases dramatically under uncertainty
Human Decision-Making Impact Study: Quantification of dual effects of LLM input on human accuracy and bias
Welfare Effect Analysis: Establishment of theoretical models analyzing welfare impacts of LLM exposure

Methodology Details

Task Definition

The study designed three interconnected experiments:

LLM Overconfidence Assessment: Measuring accuracy and confidence levels of LLMs on reasoning tasks
Human Benchmark Testing: Evaluating human performance on identical tasks
LLM Exposure Experiment: Testing the impact of LLM input on human decision-making

Problem Generation Method

Triple Extraction

Extraction of structured triples (subject, predicate, object) from Wikidata, covering ten popular categories.

Logical Reasoning Rules

Implementation of five reasoning types:

Negation Reasoning: Deriving the validity of negations from factual knowledge
Symmetric Reasoning: Exchanging subject and object in symmetric relations
Inverse Reasoning: Connecting subject and object through inverse relations
Transitive Reasoning: Chain reasoning to generate new triples
Composite Reasoning: Combining multiple basic reasoning rules

Problem Validation

Automatic reasoning using Prolog inference engine, manual validation of predicate components, ultimately retaining 476 predicates and corresponding triples.

Confidence Measurement

Using specially designed prompts to simultaneously obtain:

Confidence in answer correctness
Confidence in factual knowledge correctness
Confidence in reasoning process correctness

Similarity Assessment

Development of algorithms to compute similarity between LLM responses and standard answers:

Factual Similarity: Based on subject matching and object similarity
Reasoning Similarity: Evaluating predicate and object matching

Experimental Setup

Dataset

Scale: 10,000 balanced reasoning problems
Distribution: 5 reasoning types × 10 knowledge domains, 200 problems per combination
Human Benchmark: 2,000 problems selected for human experiments

Model Selection

Five representative LLMs tested:

Closed-source Models: GPT-3.5, GPT-4o, GPT-o1
Open-source Models: Llama 3.1 8B, Llama 3.2 3B

Evaluation Metrics

Accuracy: Proportion of correct answers
Confidence: Model's self-reported probability of correctness
Bias: Difference between confidence and accuracy
Confidence Gradient: Rate of change in accuracy relative to confidence

Human Experiment Design

Platform: Prolific online experimental platform
Incentive Mechanism: Following the real incentive mechanism of Danz et al. (2022)
Sample: 588 participants in baseline experiment, 1,161 in exposure experiment

Experimental Results

LLM Overconfidence Performance

Main Findings

All five LLMs exhibited significant overconfidence:

GPT-3.5: Accuracy 35%, Confidence 94%, Bias 59%
GPT-4o: Accuracy 63%, Confidence 94%, Bias 30%
GPT-o1: Accuracy 73%, Confidence 95%, Bias 22%
Llama 3.1: Accuracy 63%, Confidence 86%, Bias 23%
Llama 3.2: Accuracy 61%, Confidence 94%, Bias 33%

Confidence Gradient Analysis

More advanced models display stronger confidence gradients:

GPT-4o and GPT-o1: 10% decrease in confidence corresponds to approximately 25% decrease in accuracy
Llama 3.1: 10% decrease in confidence corresponds to approximately 13% decrease in accuracy

Human-Machine Comparison Results

Performance Comparison

Human Accuracy: 66% (comparable to GPT-4o and Llama 3.1)
Human Confidence: 70% (only 4% overconfidence)
Key Difference: Humans show reduced bias when uncertain; LLMs show the opposite

Dunning-Kruger Effect

LLMs exhibit a stronger Dunning-Kruger effect than humans:

When completely confident, LLM accuracy is 79-85% (still 15-21% bias)
Humans show slight underestimation when uncertain (54% accuracy vs. 50% expected)

Impact of LLM Exposure on Humans

Accuracy Improvement

LLM Answer Group: 5.6 percentage point accuracy improvement
LLM Answer + Confidence Group: 7.0 percentage point accuracy improvement

Bias Amplification

LLM Answer Group: 4.2 percentage point bias increase (doubled)
LLM Answer + Confidence Group: 7.6 percentage point bias increase (nearly tripled)

Heterogeneous Effects

Participants with low baseline confidence benefit most:

Accuracy improvement of 8.6-11.9 percentage points
But bias also increases by 7.0-14.1 percentage points

LLM Calibration Research

Existing research primarily employs three methods for measuring LLM confidence:

Logit-based Estimation: Requires internal model access
Direct Confidence Elicitation: Direct questioning through prompts
Auxiliary Model Approach: From single-model prediction to multi-source integration

The innovation of this research lies in using algorithmically-generated problems to ensure minimal training contamination.

Overconfidence Research

Effects of overconfidence across multiple domains:

Corporate Decision-Making: Influences financing choices and M&A decisions
Personal Behavior: Influences health choices and investment decisions
Learning Processes: May lead to persistent bias rather than adaptive learning

Human-Machine Interaction

Emerging research explores how individuals respond to (potentially biased) AI input, and this study makes important contributions to this field.

Conclusions and Discussion

Main Conclusions

Universal Overconfidence: All tested LLMs exhibit significant overconfidence, far exceeding human levels
Dunning-Kruger Effect: LLM bias increases dramatically under uncertainty, lacking awareness of knowledge boundaries
Dual Impact: While LLM input improves human accuracy, it significantly increases overconfidence
Welfare Complexity: In environments requiring investment decisions, increased bias may offset accuracy gains

Theoretical Insights

Dunning-Kruger Mechanism

LLMs are "trapped" within their prediction models:

Unable to perceive knowledge absent from training data
Form accuracy estimates based on training data
Lack human intuitive recognition of knowledge limitations

Welfare Theoretical Model

Establishment of welfare models considering both accuracy and bias:

When investments have high elasticity to success probability, negative effects of overconfidence are greater
Even with improved accuracy, LLM exposure may reduce overall welfare

Limitations

Task Scope: Limited to binary-choice reasoning problems
Model Versions: Results may change with model updates
Cultural Differences: Human experiments primarily based on English speakers
Temporal Effects: Does not consider long-term learning and adaptation effects

Practical Implications

Guidance for Users

Provides new benchmarks for evaluating LLM reasoning capabilities
Emphasizes the need for appropriate skepticism toward LLM recommendations

Recommendations for Developers

Current training objectives prioritize fluency over accuracy
Need to develop built-in uncertainty correction mechanisms
Recommend integrating verification mechanisms to check reasoning processes

Inspiration for Research

Emphasizes the importance of evaluating behavioral biases in LLMs
Provides a paradigm for research on other cognitive biases
Promotes interdisciplinary collaboration between behavioral science and computer science

In-Depth Evaluation

Strengths

Methodological Innovation:
- Algorithmically-generated problems minimize training contamination
- Multi-dimensional confidence measurement (answer, fact, reasoning)
- Rigorous human-machine comparative experimental design
Experimental Sufficiency:
- Large-scale experiments (10,000 LLM problems, 5,000+ human responses)
- Robustness checks across multiple models and temperature settings
- Detailed ablation studies and reproducibility verification
Theoretical Contributions:
- First revelation of Dunning-Kruger effect in LLMs
- Establishment of welfare analysis framework for LLM exposure
- New perspective on confidence calibration
Practical Value:
- Important safety considerations for LLM applications
- Direct guidance for AI system design
- Scientific evidence for regulatory policy formulation

Limitations

Task Limitations:
- Considers only binary-choice problems, may not fully represent real-world application scenarios
- Relatively simple reasoning types, lacking more complex multi-step reasoning
Measurement Methods:
- Confidence measurement relies on self-report, potentially subject to prompt sensitivity
- Similarity assessment algorithm may introduce subjectivity
Sample Representativeness:
- Human experiments primarily based on online platform users
- Lacks diversity across different cultural backgrounds and professional domains
Long-term Effects:
- Does not consider learning effects from repeated exposure
- Lacks ecological validity verification in actual decision-making environments

Impact Assessment

Academic Impact

Theoretical Contribution: Opens new directions for research on LLM behavioral biases
Methodological Value: Provides replicable experimental paradigm
Interdisciplinary Significance: Connects AI, cognitive science, and behavioral economics

Practical Impact

Industry Application: Influences LLM product design and deployment strategies
Educational Value: Increases public awareness of AI system limitations
Policy Formulation: Provides scientific evidence for AI governance

Applicable Scenarios

High-Risk Decision-Making: Medical diagnosis, financial investment and other scenarios requiring accuracy assessment
Educational Applications: Need to consider overconfidence effects on learning outcomes
Human-Machine Collaboration: Design better confidence communication mechanisms
AI Safety: Develop more reliable uncertainty quantification methods

Future Research Directions

Extended Task Types: Research more complex reasoning tasks and open-ended questions
Cross-Cultural Validation: Verify universality of findings across different cultural backgrounds
Intervention Mechanisms: Develop training and prompting methods to reduce overconfidence
Long-term Effects: Study learning and adaptation processes in repeated interactions
Other Biases: Systematically investigate other cognitive biases in LLMs

References

The paper cites rich relevant literature covering:

Overconfidence research in behavioral economics (Kahneman, 2011; Moore and Healy, 2008)
LLM calibration and uncertainty quantification (Tian et al., 2023; Wei et al., 2024)
Human-machine interaction and AI bias (Barocas and Selbst, 2016; Rambachan and Roth, 2020)
Classical research on the Dunning-Kruger effect (Kruger and Dunning, 1999)

This research provides important insights for understanding and improving the reliability of large language models, with profound implications for AI safety and human-machine collaboration. By revealing the overconfidence problem in LLMs, the study points the way toward developing more trustworthy AI systems.