Large language models (LLMs) are revolutionizing every aspect of society. They are increasingly used in problem-solving tasks to substitute human assessment and reasoning. LLMs are trained on what humans write and are thus exposed to human bias. We evaluate whether LLMs inherit one of the most widespread human biases: overconfidence. We algorithmically construct reasoning problems with known ground truths. We prompt LLMs to answer these problems and assess the confidence in their answers, closely following similar protocols in human experiments. We find that all five LLMs we study are overconfident: they overestimate the probability that their answer is correct between 20% and 60%. Humans have accuracy similar to the more advanced LLMs, but far lower overconfidence. Although humans and LLMs are similarly biased in questions which they are certain they answered correctly, a key difference emerges between them: LLM bias increases sharply relative to humans if they become less sure that their answers are correct. We also show that LLM input has ambiguous effects on human decision making: LLM input leads to an increase in the accuracy, but it more than doubles the extent of overconfidence in the answers.
Large Language Models are Overconfident and Amplify Human Bias
- Paper ID: 2505.02151
- Title: Large Language Models are Overconfident and Amplify Human Bias
- Authors: Fengfei Sun, Ningke Li, Kailong Wang, Lorenz Goette
- Classification: cs.SE (Software Engineering), cs.CY (Computers and Society)
- Publication Date: May 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2505.02151v2
Large Language Models (LLMs) are fundamentally transforming various aspects of society and are increasingly being deployed for problem-solving tasks that replace human evaluation and reasoning. Since LLMs are trained on human-authored content, they are exposed to human biases. This study evaluates whether LLMs have inherited one of the most prevalent human biases: overconfidence. The researchers algorithmically constructed reasoning problems with known ground truth answers, prompted LLMs to answer these questions, and assessed the confidence levels of their responses. The findings reveal that all five LLMs studied exhibit significant overconfidence: they overestimate the probability of their answers being correct by 20% to 60%. While human accuracy is comparable to more advanced LLMs, humans display substantially lower levels of overconfidence. When LLMs are less certain about their answers, their bias relative to humans increases dramatically. The study also demonstrates that LLM inputs have complex effects on human decision-making: while improving accuracy, they more than double the level of overconfidence.
The core research question addressed is: Do large language models inherit and amplify human overconfidence bias? This question is significant for several reasons:
- Widespread Application Scenarios: LLMs are increasingly deployed for problem-solving tasks requiring careful reasoning and evaluation
- Training Data Bias: LLMs trained on human-authored content are inherently exposed to human biases
- Decision-Making Impact: Overconfidence has been demonstrated to affect multiple domains of professional and everyday decision-making
Overconfidence is one of the most prevalent biases in human judgment and has produced negative effects across multiple domains:
- Professional Domains: Overconfident managers are more likely to pursue unprofitable mergers and acquisitions
- Daily Behavior: Influences exercise habits, dietary choices, and financial investment decisions
- Learning Capacity: May lead to persistent bias rather than learning from feedback
Current research on LLM calibration primarily suffers from the following issues:
- Relies mainly on standard question-answering datasets, which LLMs likely encountered during training
- Lacks investigation of confidence levels for questions requiring reasoning abilities
- Insufficiently explores the impact of LLM confidence on human decision-making
- First Systematic Assessment: Comprehensive evaluation of overconfidence bias in five mainstream LLMs
- Innovative Experimental Design: Construction of 10,000 algorithmically-generated reasoning problems to ensure minimal training contamination
- Human-Machine Comparative Analysis: Direct comparison of LLM and human performance on identical tasks
- Confidence Gradient Findings: Revelation of the "Dunning-Kruger effect" where LLM bias increases dramatically under uncertainty
- Human Decision-Making Impact Study: Quantification of dual effects of LLM input on human accuracy and bias
- Welfare Effect Analysis: Establishment of theoretical models analyzing welfare impacts of LLM exposure
The study designed three interconnected experiments:
- LLM Overconfidence Assessment: Measuring accuracy and confidence levels of LLMs on reasoning tasks
- Human Benchmark Testing: Evaluating human performance on identical tasks
- LLM Exposure Experiment: Testing the impact of LLM input on human decision-making
Extraction of structured triples (subject, predicate, object) from Wikidata, covering ten popular categories.
Implementation of five reasoning types:
- Negation Reasoning: Deriving the validity of negations from factual knowledge
- Symmetric Reasoning: Exchanging subject and object in symmetric relations
- Inverse Reasoning: Connecting subject and object through inverse relations
- Transitive Reasoning: Chain reasoning to generate new triples
- Composite Reasoning: Combining multiple basic reasoning rules
Automatic reasoning using Prolog inference engine, manual validation of predicate components, ultimately retaining 476 predicates and corresponding triples.
Using specially designed prompts to simultaneously obtain:
- Confidence in answer correctness
- Confidence in factual knowledge correctness
- Confidence in reasoning process correctness
Development of algorithms to compute similarity between LLM responses and standard answers:
- Factual Similarity: Based on subject matching and object similarity
- Reasoning Similarity: Evaluating predicate and object matching
- Scale: 10,000 balanced reasoning problems
- Distribution: 5 reasoning types × 10 knowledge domains, 200 problems per combination
- Human Benchmark: 2,000 problems selected for human experiments
Five representative LLMs tested:
- Closed-source Models: GPT-3.5, GPT-4o, GPT-o1
- Open-source Models: Llama 3.1 8B, Llama 3.2 3B
- Accuracy: Proportion of correct answers
- Confidence: Model's self-reported probability of correctness
- Bias: Difference between confidence and accuracy
- Confidence Gradient: Rate of change in accuracy relative to confidence
- Platform: Prolific online experimental platform
- Incentive Mechanism: Following the real incentive mechanism of Danz et al. (2022)
- Sample: 588 participants in baseline experiment, 1,161 in exposure experiment
All five LLMs exhibited significant overconfidence:
- GPT-3.5: Accuracy 35%, Confidence 94%, Bias 59%
- GPT-4o: Accuracy 63%, Confidence 94%, Bias 30%
- GPT-o1: Accuracy 73%, Confidence 95%, Bias 22%
- Llama 3.1: Accuracy 63%, Confidence 86%, Bias 23%
- Llama 3.2: Accuracy 61%, Confidence 94%, Bias 33%
More advanced models display stronger confidence gradients:
- GPT-4o and GPT-o1: 10% decrease in confidence corresponds to approximately 25% decrease in accuracy
- Llama 3.1: 10% decrease in confidence corresponds to approximately 13% decrease in accuracy
- Human Accuracy: 66% (comparable to GPT-4o and Llama 3.1)
- Human Confidence: 70% (only 4% overconfidence)
- Key Difference: Humans show reduced bias when uncertain; LLMs show the opposite
LLMs exhibit a stronger Dunning-Kruger effect than humans:
- When completely confident, LLM accuracy is 79-85% (still 15-21% bias)
- Humans show slight underestimation when uncertain (54% accuracy vs. 50% expected)
- LLM Answer Group: 5.6 percentage point accuracy improvement
- LLM Answer + Confidence Group: 7.0 percentage point accuracy improvement
- LLM Answer Group: 4.2 percentage point bias increase (doubled)
- LLM Answer + Confidence Group: 7.6 percentage point bias increase (nearly tripled)
Participants with low baseline confidence benefit most:
- Accuracy improvement of 8.6-11.9 percentage points
- But bias also increases by 7.0-14.1 percentage points
Existing research primarily employs three methods for measuring LLM confidence:
- Logit-based Estimation: Requires internal model access
- Direct Confidence Elicitation: Direct questioning through prompts
- Auxiliary Model Approach: From single-model prediction to multi-source integration
The innovation of this research lies in using algorithmically-generated problems to ensure minimal training contamination.
Effects of overconfidence across multiple domains:
- Corporate Decision-Making: Influences financing choices and M&A decisions
- Personal Behavior: Influences health choices and investment decisions
- Learning Processes: May lead to persistent bias rather than adaptive learning
Emerging research explores how individuals respond to (potentially biased) AI input, and this study makes important contributions to this field.
- Universal Overconfidence: All tested LLMs exhibit significant overconfidence, far exceeding human levels
- Dunning-Kruger Effect: LLM bias increases dramatically under uncertainty, lacking awareness of knowledge boundaries
- Dual Impact: While LLM input improves human accuracy, it significantly increases overconfidence
- Welfare Complexity: In environments requiring investment decisions, increased bias may offset accuracy gains
LLMs are "trapped" within their prediction models:
- Unable to perceive knowledge absent from training data
- Form accuracy estimates based on training data
- Lack human intuitive recognition of knowledge limitations
Establishment of welfare models considering both accuracy and bias:
- When investments have high elasticity to success probability, negative effects of overconfidence are greater
- Even with improved accuracy, LLM exposure may reduce overall welfare
- Task Scope: Limited to binary-choice reasoning problems
- Model Versions: Results may change with model updates
- Cultural Differences: Human experiments primarily based on English speakers
- Temporal Effects: Does not consider long-term learning and adaptation effects
- Provides new benchmarks for evaluating LLM reasoning capabilities
- Emphasizes the need for appropriate skepticism toward LLM recommendations
- Current training objectives prioritize fluency over accuracy
- Need to develop built-in uncertainty correction mechanisms
- Recommend integrating verification mechanisms to check reasoning processes
- Emphasizes the importance of evaluating behavioral biases in LLMs
- Provides a paradigm for research on other cognitive biases
- Promotes interdisciplinary collaboration between behavioral science and computer science
- Methodological Innovation:
- Algorithmically-generated problems minimize training contamination
- Multi-dimensional confidence measurement (answer, fact, reasoning)
- Rigorous human-machine comparative experimental design
- Experimental Sufficiency:
- Large-scale experiments (10,000 LLM problems, 5,000+ human responses)
- Robustness checks across multiple models and temperature settings
- Detailed ablation studies and reproducibility verification
- Theoretical Contributions:
- First revelation of Dunning-Kruger effect in LLMs
- Establishment of welfare analysis framework for LLM exposure
- New perspective on confidence calibration
- Practical Value:
- Important safety considerations for LLM applications
- Direct guidance for AI system design
- Scientific evidence for regulatory policy formulation
- Task Limitations:
- Considers only binary-choice problems, may not fully represent real-world application scenarios
- Relatively simple reasoning types, lacking more complex multi-step reasoning
- Measurement Methods:
- Confidence measurement relies on self-report, potentially subject to prompt sensitivity
- Similarity assessment algorithm may introduce subjectivity
- Sample Representativeness:
- Human experiments primarily based on online platform users
- Lacks diversity across different cultural backgrounds and professional domains
- Long-term Effects:
- Does not consider learning effects from repeated exposure
- Lacks ecological validity verification in actual decision-making environments
- Theoretical Contribution: Opens new directions for research on LLM behavioral biases
- Methodological Value: Provides replicable experimental paradigm
- Interdisciplinary Significance: Connects AI, cognitive science, and behavioral economics
- Industry Application: Influences LLM product design and deployment strategies
- Educational Value: Increases public awareness of AI system limitations
- Policy Formulation: Provides scientific evidence for AI governance
- High-Risk Decision-Making: Medical diagnosis, financial investment and other scenarios requiring accuracy assessment
- Educational Applications: Need to consider overconfidence effects on learning outcomes
- Human-Machine Collaboration: Design better confidence communication mechanisms
- AI Safety: Develop more reliable uncertainty quantification methods
- Extended Task Types: Research more complex reasoning tasks and open-ended questions
- Cross-Cultural Validation: Verify universality of findings across different cultural backgrounds
- Intervention Mechanisms: Develop training and prompting methods to reduce overconfidence
- Long-term Effects: Study learning and adaptation processes in repeated interactions
- Other Biases: Systematically investigate other cognitive biases in LLMs
The paper cites rich relevant literature covering:
- Overconfidence research in behavioral economics (Kahneman, 2011; Moore and Healy, 2008)
- LLM calibration and uncertainty quantification (Tian et al., 2023; Wei et al., 2024)
- Human-machine interaction and AI bias (Barocas and Selbst, 2016; Rambachan and Roth, 2020)
- Classical research on the Dunning-Kruger effect (Kruger and Dunning, 1999)
This research provides important insights for understanding and improving the reliability of large language models, with profound implications for AI safety and human-machine collaboration. By revealing the overconfidence problem in LLMs, the study points the way toward developing more trustworthy AI systems.