2025-11-13T14:19:10.992196

Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning

Yamin, Ghosal, Wilder

Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability -- often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM's abilities to re-purpose parametric knowledge in novel settings.

academic

Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning

Basic Information

Paper ID: 2506.15732
Title: Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning?
Authors: Khurram Yamin*, Gaurav Ghosal*, Bryan Wilder (Carnegie Mellon University)
Classification: cs.AI cs.LG
Publication Time/Venue: ICLR 2026
Paper Link: https://arxiv.org/abs/2506.15732v2

Abstract

Large Language Models (LLMs) contain rich world knowledge embedded in their parameters and demonstrate strong performance on many knowledge-intensive tasks. However, when deployed in new environments, LLMs frequently encounter situations where they must reconcile parametric knowledge with novel or unfamiliar information. This study investigates whether LLMs can integrate contextual knowledge with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real-world experiments on multi-hop reasoning problems, the research reveals that LLMs face widespread difficulties in counterfactual reasoning, often relying solely on their parametric knowledge. Furthermore, simple post-hoc fine-tuning struggles to instill counterfactual reasoning capabilities and frequently leads to degradation of stored parametric knowledge. Ultimately, this work exposes significant limitations in current LLMs' ability to repurpose parametric knowledge in new settings.

Research Background and Motivation

Core Research Question

The central question this study addresses is: Can modern LLMs selectively integrate parametric knowledge with counterfactual premises provided in context to correctly answer multi-hop questions?

Problem Significance

Practical Application Demands: Many real-world scenarios require LLMs to combine pre-trained knowledge with novel or hypothetical information provided at inference time
Knowledge Conflict Challenges: Retrieval-augmented generation faces difficulties when external documents conflict with internal knowledge
Safety-Critical Applications: Accurate conditional reasoning is crucial in interactive systems, retrieval-augmented pipelines, and safety-critical applications

Limitations of Existing Approaches

Existing multi-hop QA benchmarks primarily evaluate models' ability to recall stored facts or compose chains of parametric knowledge, not testing dual requirements
Knowledge conflict research lacks systematic exploration of counterfactual multi-hop reasoning
RAG methods, while capable of merging external information, cannot handle the unique challenges of counterfactual reasoning

Research Motivation

Through the specific task of counterfactual reasoning, systematically investigate LLMs' performance when facing knowledge conflicts, particularly their ability to simultaneously perform contextual override and selective retrieval.

Core Contributions

Counterfactual QA Benchmark: Introduces tasks based on synthetic graphs and real-world causal reasoning scenarios, isolating (i) reinforcing, (ii) adding, (iii) contradicting, and (iv) irrelevant contextual information relative to pre-trained knowledge graphs
Empirical Analysis: Through experiments with GPT-4o and other SOTA models, identifies two primary failure modes: (a) context neglect (models default to using stored facts) and (b) context overfitting (models blindly follow prompts)
Fine-tuning Pitfall Analysis: Demonstrates that simple post-hoc fine-tuning often yields only marginal gains on counterfactual examples and may degrade performance on standard factual benchmarks by inducing unexpected heuristics
Practical Implications: Discusses research findings' implications for interactive systems, retrieval-augmented pipelines, and safety-critical applications

Methodology Details

Task Definition

The study defines a counterfactual multi-hop reasoning task requiring models to:

Contextual Override: Temporarily suppress default facts and accept hypothetical premises
Selective Retrieval: Retrieve and utilize relevant associations stored in weights, even when some information has been modified

Example: "If Paris were located in Italy, in which country would the Eiffel Tower be?"

Requires overriding the parametric knowledge that "Paris is in France"
Requires retaining the association "Eiffel Tower is in Paris"

Experimental Design

Real-World LLMs Experiments

Contextual information is categorized into four scenarios:

Scenario 1 (Reinforcing Prior Knowledge): Provides relationships already existing in the parametric knowledge graph
Scenario 2 (Adding New Information): Provides information needed to answer queries but missing from the parametric knowledge graph
Scenario 3 (Contradicting Prior Knowledge): Provides information strongly conflicting with existing parametric knowledge
Scenario 4 (Irrelevant Information): Provides information unrelated to the query

Synthetic Environment Experiments

In controlled synthetic knowledge graph settings:

Randomly generate directed graph G with vertices representing entities and edges representing relations
Distinguish between atomic facts (single edges) and inferred facts (two-hop combinations)
Test three counterfactual types:
- Hop 1 relevant: counterfactual premise modifies the first hop of inferred facts
- Hop 2 relevant: counterfactual premise modifies the link between bridge entity and final answer
- Irrelevant counterfactual: counterfactual premise completely unrelated to multi-hop query

Prompting Strategies

Compare three strategies:

Standard: Direct causal query
CoT: Chain-of-thought prompting
FT: Fine-tuning on counterfactual examples with CoT explanations

Experimental Setup

Datasets

Real-world Experiments: Binary classification tasks based on causal relationships with 50% random baseline
Synthetic Experiments: Randomly generated knowledge graphs containing atomic and inferred facts

Evaluation Metrics

Accuracy
Performance on 1-hop and 2-hop reasoning tasks

Baseline Methods

GPT-4o (standard, CoT, fine-tuned versions)
GPT-5 (Thinking)
Llama 3.1 8B

Implementation Details

GPT Fine-tuning: 38,754 training tokens, 3 epochs, batch size 1, learning rate multiplier 2
Llama Fine-tuning: 5 epochs, LoRA rank 8, learning rate 0.0001
Synthetic Experiments: 4 NVIDIA A6000 GPUs, total 72 GPU hours

Experimental Results

Main Results

Real-World LLMs Performance

Scenario 1 (Reinforcing Prior): All models perform excellently with accuracy between 90%-100%
Scenario 2 (Adding Information): Non-fine-tuned models achieve 60-75% accuracy, improving to ~90% after fine-tuning
Scenario 3 (Conflicting Prior): Performance collapses to near 50% baseline, with fine-tuning providing only marginal improvements
Scenario 4 (Irrelevant Information): Strong performance with GPT-5 approaching near-perfect accuracy

Synthetic Environment Findings

Fine-tuning Induces Shortcuts: Models quickly learn to repeat entities shown in counterfactual premises rather than perform genuine reasoning
Selective Override Difficulty: Models cannot learn to distinguish when counterfactual premises are relevant
Incorporating Counterfactual Data During Pre-training: Can improve counterfactual reasoning performance but may harm factual task performance

Ablation Studies

Control experiments prove performance degradation is not caused by format changes:

Construct CoT tasks not requiring contextual override
Fine-tuning rapidly adapts to such tasks (100% test accuracy)
Demonstrates that counterfactual reasoning failures stem from task difficulty itself, not general catastrophic forgetting

Key Findings

Two Primary Failure Modes:
- Context neglect: Models default to using stored facts
- Context overfitting: Models blindly follow prompts but forget relevant links
Impact of Alignment: Modern production LLMs trained for factuality and safety alignment tend to rely on pre-trained parametric knowledge
Fine-tuning Limitations: Simple post-hoc fine-tuning struggles to instill robust counterfactual reasoning capabilities

Multi-hop Question Answering

Benchmarks like HotpotQA test multi-hop reasoning capabilities
Existing work primarily focuses on multi-hop reasoning involving only parametric knowledge
This paper uniquely studies scenarios requiring integration of parametric and contextual knowledge

Knowledge Conflicts

RAG methods attempt to merge parametric memory with retrieved information
Existing approaches typically unsuitable for counterfactual reasoning's unique challenges
Requires selective retention and integration of parametric knowledge rather than complete abandonment

Causal Reasoning and Counterfactuals

LLMs' causal reasoning capabilities are an active research area
Existing benchmarks (CLadder, CounterBench, etc.) reveal LLM limitations in formal counterfactual reasoning
This paper fills the gap in understanding how LLMs integrate parametric knowledge with counterfactual premises in multi-hop reasoning

Conclusions and Discussion

Main Conclusions

Fundamental Limitations: Current LLMs lack robust mechanisms to dynamically modify or extend internal knowledge graphs in response to conflicting or novel information
Pervasive Failure Modes: Context neglect and context overfitting issues persist across different prompting strategies and fine-tuning methods
Limited Fine-tuning Effectiveness: Simple fine-tuning methods cannot effectively address counterfactual reasoning problems and may damage existing knowledge

Limitations

Simplified Settings: Counterfactual premises in synthetic environments expressed as static knowledge graph single-edge edits, queries limited to two-hop chains
Insufficient Complexity: Real-world scenarios involve multi-predicate interactions, ambiguous or probabilistic relations, multi-source noisy evidence
Depth Constraints: Not extended to deeper and noisier multi-hop relations

Future Directions

Novel Modeling Paradigms: Need to develop new modeling and training paradigms that dynamically integrate stored and contextual knowledge without compromising either
Mechanism Research: Deeper investigation into mechanisms for implementing selective knowledge override
Complexity Extension: Extend analysis to deeper, more complex multi-hop relations and real-world scenarios

In-Depth Evaluation

Strengths

Problem Importance: Identifies and systematically investigates critical limitations of LLMs in knowledge conflict scenarios
Rigorous Experimental Design: Combines real-world and synthetic environments for comprehensive analytical perspective
Insightful Findings: Reveals two distinct failure modes, providing important insights for understanding LLM behavior
Methodological Contribution: Proposes effective framework for evaluating counterfactual reasoning capabilities

Weaknesses

Solution Absence: Primarily identifies problems without proposing effective solutions
Limited Model Coverage: Tests primarily on few models, lacking broader model evaluation
Task Complexity: Current task settings relatively simple with gaps to real-world applications
Insufficient Theoretical Analysis: Lacks deep theoretical explanation of failure mechanisms

Impact

Academic Value: Provides important foundation for LLM knowledge integration research, potentially inspiring future research directions
Practical Significance: Offers important guidance for RAG systems and applications requiring dynamic knowledge integration
Warning Function: Alerts researchers and practitioners to LLM limitations in knowledge conflict scenarios

Applicable Scenarios

Retrieval-Augmented Systems: Guides RAG system design when handling conflicting information
Interactive AI: Provides reference for dialogue systems needing to handle hypothetical scenarios
Safety-Critical Applications: Requires special caution when applying in domains demanding accurate conditional reasoning

References

The paper cites important works in related fields, including:

Multi-hop QA benchmarks (HotpotQA, NaturalQuestions)
Knowledge conflict handling methods (RAG, REALM, DPR)
Causal reasoning evaluation (CLadder, CounterBench)
LLM mechanism analysis (Grokking transformers, etc.)

Overall Assessment: This is a high-quality research paper that systematically identifies and analyzes important limitations of LLMs in counterfactual reasoning. While not providing complete solutions, it establishes an important foundation for understanding and improving LLMs' knowledge integration capabilities, significantly advancing the field's development.