Causal learning is the cognitive process of developing the capability of making causal inferences based on available information, often guided by normative principles. This process is prone to errors and biases, such as the illusion of causality, in which people perceive a causal relationship between two variables despite lacking supporting evidence. This cognitive bias has been proposed to underlie many societal problems, including social prejudice, stereotype formation, misinformation, and superstitious thinking. In this work, we examine whether large language models are prone to developing causal illusions when faced with a classic cognitive science paradigm: the contingency judgment task. To investigate this, we constructed a dataset of 1,000 null contingency scenarios (in which the available information is not sufficient to establish a causal relationship between variables) within medical contexts and prompted LLMs to evaluate the effectiveness of potential causes. Our findings show that all evaluated models systematically inferred unwarranted causal relationships, revealing a strong susceptibility to the illusion of causality. While there is ongoing debate about whether LLMs genuinely understand causality or merely reproduce causal language without true comprehension, our findings support the latter hypothesis and raise concerns about the use of language models in domains where accurate causal reasoning is essential for informed decision-making.
- Paper ID: 2510.13985
- Title: Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment
- Authors: María Victoria Carro, Denise Alejandra Mester, Francisca Gauna Selasco, Giovanni Franco Gabriel Marraffini, Mario Alejandro Leiva, Gerardo I. Simari, María Vanina Martinez
- Classification: cs.AI
- Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: First Workshop on CogInterp
- Paper Link: https://arxiv.org/abs/2510.13985
Causal learning is a cognitive process of making causal inferences based on available information, typically following normative principles. This process is susceptible to errors and biases, such as causal illusions—the perception of causal relationships between two variables in the absence of supporting evidence. Such cognitive biases are considered the root of numerous social problems, including social prejudice, stereotype formation, misinformation, and superstitious thinking. This study examines whether Large Language Models (LLMs) are susceptible to causal illusions through a classical cognitive science paradigm—the contingency judgment task. The research constructs a dataset of 1,000 zero-contingency scenarios (where available information is insufficient to establish causal relationships between variables) and prompts LLMs to evaluate the validity of potential causes in a medical context. The study finds that all evaluated models systematically infer inappropriate causal relationships, demonstrating strong susceptibility to causal illusions.
The core research question is: Do Large Language Models exhibit causal illusion biases similar to humans when confronted with classical cognitive science paradigms?
- Social Impact: Causal illusions are the root cause of social prejudice, stereotype formation, misinformation propagation, and superstitious thinking
- Practical Application: In critical domains such as healthcare, accurate causal reasoning is essential for informed decision-making
- AI Safety: With the widespread application of LLMs in decision-making systems, understanding their cognitive biases has become critically important
- Lack of systematic evaluation of LLMs' performance on contingency judgment tasks
- Ongoing debate about whether LLMs truly "understand" causality or merely replicate causal language
- Existing research primarily focuses on erroneous inferences from correlation to causation, rather than causal illusions in zero-contingency scenarios
To assess LLMs' causal reasoning abilities through classical contingency judgment tasks and provide empirical evidence for understanding their cognitive biases.
- First Adaptation of Contingency Judgment Tasks to LLM Evaluation: This is the first study applying the classical contingency judgment task from experimental psychology to Large Language Models
- Construction of Large-Scale Zero-Contingency Scenario Dataset: Creation of 1,000 zero-contingency scenarios in a medical context, encompassing four variable types
- Discovery of Universal Causal Illusions in LLMs: All evaluated models systematically infer causal relationships in zero-contingency scenarios
- Revelation of Inconsistent Causal Judgment Standards Across Models: Different models employ different causal reasoning standards, lacking consistency
Contingency Judgment Task is a classical paradigm in cognitive science for assessing causal learning:
- Input: A series of trials, each containing a potential cause (present/absent) and an outcome (occurred/did not occur)
- Output: A validity rating for the potential cause (0-100 scale, where 0 indicates ineffective and 100 indicates completely effective)
- Zero-Contingency Condition: The probability of the outcome is independent of whether the cause is present
- Variable Types (4 categories, 100 variable pairs total):
- Fictional disease and treatment names (e.g., "Glimber medicine" and "Drizzlemorn disorder")
- Uncertain variables (e.g., "Disease X" and "Medicine Y")
- Alternative and pseudo-medical variables (e.g., "Acupuncture Process")
- Validated scientific medications (e.g., "Paracetamol")
- Scenario Generation:
- 1,000 zero-contingency scenarios
- 20-100 trials per scenario
- 80/20 distribution employed to ensure zero-contingency
- Temperature Settings:
- Experiment 1: Temperature = 1, 10 repetitions per scenario
- Experiment 2: Temperature = 0 (deterministic)
- Experiment 3: Default temperature settings
- Evaluated Models:
- GPT-4o-Mini
- Claude-3.5-Sonnet
- Gemini-1.5-Pro
- Task Adaptation: Adaptation of sequential presentation methods from human cognitive experiments to natural language list format
- Role-Playing: Enhancement of task authenticity through role-playing (doctor, researcher)
- Variable Control: Strict control of zero-contingency conditions to ensure internal validity
- Scale: 1,000 zero-contingency scenarios
- Number of Trials: 20-100 trials per scenario
- Variable Pairs: 100 pairs of medical-related variables
- Distribution Control: 80/20 distribution ensuring zero-contingency
- Primary Metric: Validity ratings on a 0-100 scale
- Statistical Tests:
- Wilcoxon signed-rank test (testing deviation from 0)
- Friedman test (comparing differences between models)
- Cochran's Q test (comparing zero-response probabilities)
- Prompt Engineering: Prompts designed based on best practices from experimental psychology
- Repeated Experiments: Multiple temperature settings ensure result robustness
- Statistical Analysis: Non-parametric tests employed to handle non-normally distributed data
| Model | Mean | Median | Standard Deviation |
|---|
| GPT-4o-Mini | 75.74 | 75.7 | 11.41 |
| Claude-3.5-Sonnet | 40.54 | 50.0 | 19.67 |
| Gemini-1.5-Pro | 33.07 | 45.0 | 23.72 |
- Universal Causal Illusions: All models' medians significantly exceed 0 (p < 0.001)
- Extremely Low Zero-Response Rates:
- GPT-4o-Mini: 0%
- Claude-3.5-Sonnet: 4.6%
- Gemini-1.5-Pro: 20.5%
- Significant Differences Between Models: Friedman test reveals significant differences between models (χ² = 1516.99, p < 0.001)
Results demonstrate that models show no significant differences in causal ratings across different variable types (fictional, uncertain, alternative medicine, conventional medicine), and even tend to assign higher ratings to fictional variables.
Results remain consistent under temperature = 0 and default temperature conditions, demonstrating the robustness of findings.
- Gao et al. (2023): Evaluation of LLMs' causal reasoning abilities
- Liu et al. (2023): Causal reasoning in code domains
- Jin et al. (2024): Inference from correlation to causation
- Keshmirian et al. (2024): Biased causal judgments in LLMs
- Carro et al. (2024): Correlation-causation exaggeration in news headlines
- Jin et al. (2022): Logical fallacy detection
This study is the first to apply contingency judgment tasks to LLMs, filling an important gap between cognitive science and AI evaluation.
- Universal Causal Illusions in LLMs: All evaluated models systematically infer causal relationships in zero-contingency scenarios
- Lack of Unified Causal Judgment Standards: Different models employ different evaluation criteria
- Support for "Language Replication" Hypothesis: Results support the hypothesis that LLMs merely replicate causal language rather than truly understanding causality
- Absence of Human Baseline: No corresponding human experiments conducted for comparison
- Limited External Validity: Although experimental design follows psychological best practices, it may not fully represent real-world usage scenarios
- Rating Bias: LLMs may exhibit bias in responding to extreme values
- Internal Validity Issues: The 0-100 rating scale may not be the most suitable format for AI evaluation
- Prompt Techniques: Exploration of chain-of-thought and other prompting techniques
- Diversified Scenarios: Inclusion of positive and negative contingency scenarios
- Trial Order Effects: Investigation of trial presentation order effects on results
- Alternative Task Formats: Use of binary or multi-classification formats
- High Innovation: First application of classical cognitive science paradigms to LLM evaluation
- Rigorous Methodology: Experimental design follows psychological best practices with comprehensive statistical analysis
- Consistent Results: Results remain consistent across multiple temperature settings, enhancing credibility of findings
- Practical Significance: Important implications for AI safety and applications
- Limited Sample: Only three models evaluated; extensibility to more models remains unexplored
- Domain Limitations: Testing limited to medical domain; generalizability to other domains unknown
- Insufficient Mechanism Analysis: Lack of deep analysis of underlying mechanisms causing biases
- Absence of Solutions: No specific methods provided for mitigating causal illusions
- Academic Value: Provides new evaluation framework for AI cognitive bias research
- Practical Value: Alerts to the need for caution when using LLMs in critical decision-making domains
- Reproducibility: Complete code and data provided for reproduction and extension
This research is particularly applicable to:
- AI Safety Assessment: Evaluation of cognitive biases in AI systems
- Medical AI Applications: Risk assessment in medical decision-making systems
- Education and Training: Raising awareness of AI limitations
This study cites important literature from cognitive science, experimental psychology, and AI evaluation, particularly Matute et al. (2015)'s foundational work on causal illusions and recent research on LLMs' causal reasoning abilities.
Overall Assessment: This is a high-quality interdisciplinary research paper that successfully applies classical cognitive science paradigms to AI evaluation, revealing important deficiencies in LLMs' causal reasoning. The research methodology is rigorous, and the results have significant theoretical and practical implications, providing valuable insights for future AI safety research.