2025-11-13T14:19:10.992196

Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning

Yamin, Ghosal, Wilder
Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability -- often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM's abilities to re-purpose parametric knowledge in novel settings.
academic

Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning

Basic Information

  • Paper ID: 2506.15732
  • Title: Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning?
  • Authors: Khurram Yamin*, Gaurav Ghosal*, Bryan Wilder (Carnegie Mellon University)
  • Classification: cs.AI cs.LG
  • Publication Time/Venue: ICLR 2026
  • Paper Link: https://arxiv.org/abs/2506.15732v2

Abstract

Large Language Models (LLMs) contain rich world knowledge embedded in their parameters and demonstrate strong performance on many knowledge-intensive tasks. However, when deployed in new environments, LLMs frequently encounter situations where they must reconcile parametric knowledge with novel or unfamiliar information. This study investigates whether LLMs can integrate contextual knowledge with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real-world experiments on multi-hop reasoning problems, the research reveals that LLMs face widespread difficulties in counterfactual reasoning, often relying solely on their parametric knowledge. Furthermore, simple post-hoc fine-tuning struggles to instill counterfactual reasoning capabilities and frequently leads to degradation of stored parametric knowledge. Ultimately, this work exposes significant limitations in current LLMs' ability to repurpose parametric knowledge in new settings.

Research Background and Motivation

Core Research Question

The central question this study addresses is: Can modern LLMs selectively integrate parametric knowledge with counterfactual premises provided in context to correctly answer multi-hop questions?

Problem Significance

  1. Practical Application Demands: Many real-world scenarios require LLMs to combine pre-trained knowledge with novel or hypothetical information provided at inference time
  2. Knowledge Conflict Challenges: Retrieval-augmented generation faces difficulties when external documents conflict with internal knowledge
  3. Safety-Critical Applications: Accurate conditional reasoning is crucial in interactive systems, retrieval-augmented pipelines, and safety-critical applications

Limitations of Existing Approaches

  • Existing multi-hop QA benchmarks primarily evaluate models' ability to recall stored facts or compose chains of parametric knowledge, not testing dual requirements
  • Knowledge conflict research lacks systematic exploration of counterfactual multi-hop reasoning
  • RAG methods, while capable of merging external information, cannot handle the unique challenges of counterfactual reasoning

Research Motivation

Through the specific task of counterfactual reasoning, systematically investigate LLMs' performance when facing knowledge conflicts, particularly their ability to simultaneously perform contextual override and selective retrieval.

Core Contributions

  1. Counterfactual QA Benchmark: Introduces tasks based on synthetic graphs and real-world causal reasoning scenarios, isolating (i) reinforcing, (ii) adding, (iii) contradicting, and (iv) irrelevant contextual information relative to pre-trained knowledge graphs
  2. Empirical Analysis: Through experiments with GPT-4o and other SOTA models, identifies two primary failure modes: (a) context neglect (models default to using stored facts) and (b) context overfitting (models blindly follow prompts)
  3. Fine-tuning Pitfall Analysis: Demonstrates that simple post-hoc fine-tuning often yields only marginal gains on counterfactual examples and may degrade performance on standard factual benchmarks by inducing unexpected heuristics
  4. Practical Implications: Discusses research findings' implications for interactive systems, retrieval-augmented pipelines, and safety-critical applications

Methodology Details

Task Definition

The study defines a counterfactual multi-hop reasoning task requiring models to:

  1. Contextual Override: Temporarily suppress default facts and accept hypothetical premises
  2. Selective Retrieval: Retrieve and utilize relevant associations stored in weights, even when some information has been modified

Example: "If Paris were located in Italy, in which country would the Eiffel Tower be?"

  • Requires overriding the parametric knowledge that "Paris is in France"
  • Requires retaining the association "Eiffel Tower is in Paris"

Experimental Design

Real-World LLMs Experiments

Contextual information is categorized into four scenarios:

  1. Scenario 1 (Reinforcing Prior Knowledge): Provides relationships already existing in the parametric knowledge graph
  2. Scenario 2 (Adding New Information): Provides information needed to answer queries but missing from the parametric knowledge graph
  3. Scenario 3 (Contradicting Prior Knowledge): Provides information strongly conflicting with existing parametric knowledge
  4. Scenario 4 (Irrelevant Information): Provides information unrelated to the query

Synthetic Environment Experiments

In controlled synthetic knowledge graph settings:

  • Randomly generate directed graph G with vertices representing entities and edges representing relations
  • Distinguish between atomic facts (single edges) and inferred facts (two-hop combinations)
  • Test three counterfactual types:
    • Hop 1 relevant: counterfactual premise modifies the first hop of inferred facts
    • Hop 2 relevant: counterfactual premise modifies the link between bridge entity and final answer
    • Irrelevant counterfactual: counterfactual premise completely unrelated to multi-hop query

Prompting Strategies

Compare three strategies:

  1. Standard: Direct causal query
  2. CoT: Chain-of-thought prompting
  3. FT: Fine-tuning on counterfactual examples with CoT explanations

Experimental Setup

Datasets

  • Real-world Experiments: Binary classification tasks based on causal relationships with 50% random baseline
  • Synthetic Experiments: Randomly generated knowledge graphs containing atomic and inferred facts

Evaluation Metrics

  • Accuracy
  • Performance on 1-hop and 2-hop reasoning tasks

Baseline Methods

  • GPT-4o (standard, CoT, fine-tuned versions)
  • GPT-5 (Thinking)
  • Llama 3.1 8B

Implementation Details

  • GPT Fine-tuning: 38,754 training tokens, 3 epochs, batch size 1, learning rate multiplier 2
  • Llama Fine-tuning: 5 epochs, LoRA rank 8, learning rate 0.0001
  • Synthetic Experiments: 4 NVIDIA A6000 GPUs, total 72 GPU hours

Experimental Results

Main Results

Real-World LLMs Performance

  1. Scenario 1 (Reinforcing Prior): All models perform excellently with accuracy between 90%-100%
  2. Scenario 2 (Adding Information): Non-fine-tuned models achieve 60-75% accuracy, improving to ~90% after fine-tuning
  3. Scenario 3 (Conflicting Prior): Performance collapses to near 50% baseline, with fine-tuning providing only marginal improvements
  4. Scenario 4 (Irrelevant Information): Strong performance with GPT-5 approaching near-perfect accuracy

Synthetic Environment Findings

  • Fine-tuning Induces Shortcuts: Models quickly learn to repeat entities shown in counterfactual premises rather than perform genuine reasoning
  • Selective Override Difficulty: Models cannot learn to distinguish when counterfactual premises are relevant
  • Incorporating Counterfactual Data During Pre-training: Can improve counterfactual reasoning performance but may harm factual task performance

Ablation Studies

Control experiments prove performance degradation is not caused by format changes:

  • Construct CoT tasks not requiring contextual override
  • Fine-tuning rapidly adapts to such tasks (100% test accuracy)
  • Demonstrates that counterfactual reasoning failures stem from task difficulty itself, not general catastrophic forgetting

Key Findings

  1. Two Primary Failure Modes:
    • Context neglect: Models default to using stored facts
    • Context overfitting: Models blindly follow prompts but forget relevant links
  2. Impact of Alignment: Modern production LLMs trained for factuality and safety alignment tend to rely on pre-trained parametric knowledge
  3. Fine-tuning Limitations: Simple post-hoc fine-tuning struggles to instill robust counterfactual reasoning capabilities

Multi-hop Question Answering

  • Benchmarks like HotpotQA test multi-hop reasoning capabilities
  • Existing work primarily focuses on multi-hop reasoning involving only parametric knowledge
  • This paper uniquely studies scenarios requiring integration of parametric and contextual knowledge

Knowledge Conflicts

  • RAG methods attempt to merge parametric memory with retrieved information
  • Existing approaches typically unsuitable for counterfactual reasoning's unique challenges
  • Requires selective retention and integration of parametric knowledge rather than complete abandonment

Causal Reasoning and Counterfactuals

  • LLMs' causal reasoning capabilities are an active research area
  • Existing benchmarks (CLadder, CounterBench, etc.) reveal LLM limitations in formal counterfactual reasoning
  • This paper fills the gap in understanding how LLMs integrate parametric knowledge with counterfactual premises in multi-hop reasoning

Conclusions and Discussion

Main Conclusions

  1. Fundamental Limitations: Current LLMs lack robust mechanisms to dynamically modify or extend internal knowledge graphs in response to conflicting or novel information
  2. Pervasive Failure Modes: Context neglect and context overfitting issues persist across different prompting strategies and fine-tuning methods
  3. Limited Fine-tuning Effectiveness: Simple fine-tuning methods cannot effectively address counterfactual reasoning problems and may damage existing knowledge

Limitations

  1. Simplified Settings: Counterfactual premises in synthetic environments expressed as static knowledge graph single-edge edits, queries limited to two-hop chains
  2. Insufficient Complexity: Real-world scenarios involve multi-predicate interactions, ambiguous or probabilistic relations, multi-source noisy evidence
  3. Depth Constraints: Not extended to deeper and noisier multi-hop relations

Future Directions

  1. Novel Modeling Paradigms: Need to develop new modeling and training paradigms that dynamically integrate stored and contextual knowledge without compromising either
  2. Mechanism Research: Deeper investigation into mechanisms for implementing selective knowledge override
  3. Complexity Extension: Extend analysis to deeper, more complex multi-hop relations and real-world scenarios

In-Depth Evaluation

Strengths

  1. Problem Importance: Identifies and systematically investigates critical limitations of LLMs in knowledge conflict scenarios
  2. Rigorous Experimental Design: Combines real-world and synthetic environments for comprehensive analytical perspective
  3. Insightful Findings: Reveals two distinct failure modes, providing important insights for understanding LLM behavior
  4. Methodological Contribution: Proposes effective framework for evaluating counterfactual reasoning capabilities

Weaknesses

  1. Solution Absence: Primarily identifies problems without proposing effective solutions
  2. Limited Model Coverage: Tests primarily on few models, lacking broader model evaluation
  3. Task Complexity: Current task settings relatively simple with gaps to real-world applications
  4. Insufficient Theoretical Analysis: Lacks deep theoretical explanation of failure mechanisms

Impact

  1. Academic Value: Provides important foundation for LLM knowledge integration research, potentially inspiring future research directions
  2. Practical Significance: Offers important guidance for RAG systems and applications requiring dynamic knowledge integration
  3. Warning Function: Alerts researchers and practitioners to LLM limitations in knowledge conflict scenarios

Applicable Scenarios

  1. Retrieval-Augmented Systems: Guides RAG system design when handling conflicting information
  2. Interactive AI: Provides reference for dialogue systems needing to handle hypothetical scenarios
  3. Safety-Critical Applications: Requires special caution when applying in domains demanding accurate conditional reasoning

References

The paper cites important works in related fields, including:

  • Multi-hop QA benchmarks (HotpotQA, NaturalQuestions)
  • Knowledge conflict handling methods (RAG, REALM, DPR)
  • Causal reasoning evaluation (CLadder, CounterBench)
  • LLM mechanism analysis (Grokking transformers, etc.)

Overall Assessment: This is a high-quality research paper that systematically identifies and analyzes important limitations of LLMs in counterfactual reasoning. While not providing complete solutions, it establishes an important foundation for understanding and improving LLMs' knowledge integration capabilities, significantly advancing the field's development.