2025-11-30T06:22:19.418832

Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

JarolÃm, FajÄÃk, MakaiovÃ¡

Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

academic

Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Basic Information

Paper ID: 2511.21401
Title: Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?
Authors: Antonín Jarolím, Martin Fajčík, Lucia Makaiová (Brno University of Technology, Czech Republic)
Category: cs.CL (Computational Linguistics)
Publication Date: November 26, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.21401

Abstract

This paper investigates the capability of Large Language Models (LLMs) to extract fine-grained evidence in fact-checking scenarios, with particular focus on Czech and Slovak languages. The study constructs a dual-annotated dataset containing 186 samples, with each sample annotated by two independent annotators for fine-grained evidence. Evaluation of 17 LLMs of varying sizes (from 4B to 685B parameters) reveals: (1) LLMs frequently fail to verbatim copy evidence from source text, resulting in invalid outputs; (2) despite its small size, llama3.1:8b achieves high accuracy, while gpt-oss-120b underperforms despite having more parameters; (3) qwen3:14b, deepseek-r1:32b, and gpt-oss:20b achieve effective balance between model size and alignment with human annotations.

Research Background and Motivation

1. Problem Statement

Online news article comment sections are significant venues for misinformation dissemination. To effectively manage online discussions and combat misinformation, automated systems must be capable of:

Extracting verifiable claims from user comments
Retrieving relevant trustworthy documents
Precisely locating text fragments in documents that support or refute claims (fine-grained evidence)

This paper focuses on the final task—fine-grained evidence extraction.

2. Problem Significance

User Demand: Over 3/4 of users desire expert responses to comment section discussions, yet manual responses are impractical
Efficiency and Persuasiveness: Providing entire documents as evidence is too coarse-grained, while fine-grained text fragments enable readers to quickly assess without compromising judgment accuracy
Platform Practice: X platform (formerly Twitter) uses "Community Notes," and Seznam.cz supplements selected comments with fact-checking information

3. Limitations of Existing Approaches

Coarse-grained Evidence: Existing automated fact-checking systems (e.g., FactLens, Loki) only provide paragraph-level evidence
Dataset Gaps: FEVER and SciFact provide sentence-level evidence, but no datasets exist for Czech/Slovak, and existing datasets have sentence-level as their finest granularity, not span-level
Unknown LLM Capabilities: Despite advancing LLM reasoning abilities, systematic evaluation of alignment between LLM fine-grained evidence extraction and human annotation remains absent

4. Research Motivation

To verify whether LLMs can identify and extract fine-grained evidence like humans, providing technical foundation for building automated fact-checking systems.

Core Contributions

Dataset Construction: Creates a dataset of 186 Czech/Slovak claim-text pairs with fine-grained evidence annotated by two independent annotators, filling the gap for this language pair and span-level annotation
Systematic LLM Evaluation: Evaluates 17 LLMs of varying sizes (including 685B DeepSeek-R1, 120B gpt-oss and other reasoning models, as well as open-weight models like Gemma-3 and Phi4) on fine-grained evidence extraction
Error Rate and Alignment Analysis:
- Analyzes error rates of invalid LLM outputs
- Evaluates alignment with human annotation using Hungarian matching algorithm and Token-F1
- Discovers non-linear relationship between model size and performance
Optimal Model Identification: Identifies medium-sized models (14B-32B) as achieving best balance between efficiency and accuracy

Methodology Details

Task Definition

Problem Statement: Given a claim and tokenized text t = (t₁, t₂, ..., tₙ), select a set of spans S = {s₁, s₂, ..., sₘ}, where each span sₘ = (tᵢ, ..., tⱼ) (i ≤ j) represents a continuous subsequence supporting the claim.

Key Constraints:

Spans must be continuous subsequences in the text
Select minimal text fragments
Multiple spans can be selected
Spans should directly support claim veracity

Data Construction Method

Dual Annotation Process

Sample Collection: 186 claim-text pairs
Annotator Pool: 8 non-expert paid annotators
Independent Annotation: Each sample annotated by two different annotators
Annotation Tools:
- First annotation: Custom annotation tool
- Second annotation: Label Studio
Annotation Guidelines:
"Highlight the minimal text portion supporting or refuting the claim. Highlight the part most convincing you that the statement is true."

Annotation Characteristics

Human annotators directly highlight text, ensuring selected spans are continuous in source text
LLMs must regenerate span text, potentially producing output not in source text

LLM Evidence Extraction Method

Model Selection

Evaluated three model categories:

1. Standard LLMs (9 models):

qwen2.5 (72B, 32B)
llama3.3 (70B)
llama3.1 (8B)
gemma2 (27B)
gemma3 (27B, 12B, 4B)
phi4 (14B)
mixtral (8×7B)

2. Chain-of-Thought (CoT) Reasoning Models (8 models):

deepseek-r1 (685B, 32B)
gpt-oss (120B, 20B)
qwen3 (32B, 14B)

Prompt Engineering

LLM input includes:

Original comment (providing context)
Extracted claim
Text for evidence extraction

Key Instructions:

Identify minimal text portion directly supporting the claim
Select phrases best proving claim veracity
Avoid selecting entire sentences unless absolutely necessary
Multiple spans can be selected
Do not modify, correct, or rewrite text; preserve all grammatical and syntactic errors
Output in JSON format: {"spans": [...]}
Each span must be exact substring of source text (character-for-character identical)

Baseline Methods

1. Claim Baseline:

Tokenize claim as c = (c₁, c₂, ..., cₒ)
Match claim word sequences in text
Construct span set Sᴄ

2. Query Baseline:

Use query terms annotators searched with for evidence
Same matching approach as claim baseline

3. Random Baseline:

Randomly sample continuous spans
Span count and length match randomly selected annotator

Evaluation Method

Preprocessing

Remove stopwords from all evidence sets (see Appendix A, including common Czech/Slovak stopwords like "a", "je", "to", etc.)

Token-F1 Calculation

Span Pair F1: Calculate token-level F1 scores for all possible span pairs in two annotation sets
Hungarian Matching: Use Hungarian algorithm to find optimal assignment maximizing total F1
Final Score: Average F1 of optimal matching as token-level F1 for single data point

Rationale: Since annotators and LLMs may select different span counts (different exhaustiveness), Hungarian algorithm avoids penalizing this difference.

Evaluation Metrics

Error Rate: Proportion of invalid outputs (generated spans not in source text)
Token-F1: Alignment degree with human annotation
Inter-annotator Agreement: F1 score between two annotators

Experimental Setup

Dataset

Scale: 186 samples
Language: Czech and Slovak
Annotation: 2 independent annotations per sample
Source: Verifiable claims in online news comments
Documents: Highly relevant documents found by annotators using search engines

Evaluation Metrics

Invalid %: Percentage of invalid outputs (generated spans not in source text)
Token-F1: Token-level F1 score based on Hungarian matching (0-100 scale)
Max F1: F1 score with higher of two annotators (reflects alignment with at least one annotator)

Comparison Methods

Human Annotation: ann 1 (LS) and ann 2
17 LLMs: Different sizes and architectures
3 Baselines: random, claim, query

Implementation Details

Use same prompt template (see Appendix B)
JSON format output
No technical constraints enforced (allowing generation of spans not in source text to observe errors)
Calculate F1 after removing stopwords

Experimental Results

Main Results

1. Error Rate Analysis (Figure 1)

Lowest Error Rates:

qwen2.5:72b: 4.3% (best, 72B parameters)
deepseek-r1: 7.0% (685B parameters)
llama3.1:8b: 13.4% (only 8B parameters, excellent performance)

Highest Error Rates:

mixtral:8x7b: 61.8% (worst, 7B effective parameters)
gemma3:4b: 57.5% (4B parameters)
qwen3:14b: 40.3%

Anomalies:

gpt-oss-120b: 32.8% (120B parameters but high error rate, underperformed expectations)
llama3.3:70b: 27.4% (70B parameters but relatively high error rate)

Overall Trend: Larger models typically have lower error rates, but significant exceptions exist.

2. Extraction Performance Analysis (Figure 2)

Inter-annotator Agreement:

ann 1 (LS) vs ann 2: F1 = 48

Best LLM Performance (with ann 1 (LS)):

qwen3:14b: F1 = 56 (exceeds inter-annotator agreement)
deepseek-r1:32b: F1 = 55 (exceeds inter-annotator agreement)
deepseek-r1 (685B): F1 = 38
qwen2.5:72b: F1 = 43

Alignment with ann 2:

All LLM F1 scores with ann 2 lower than with ann 1 (LS)
Indicates different annotation styles produced by two annotation environments

Baseline Performance:

claim baseline: F1 = 17 (precision ~30, very low recall)
query baseline: F1 = 12
random baseline: F1 = 10

All non-neural baseline methods show weak performance (F1 < 18).

3. Model Size and Performance Relationship (Figure 3)

Key Findings:

Small to Medium Scale: Performance improves with scale
Ultra-large Scale: 685B deepseek-r1 and 120B gpt-oss show no further improvement
Optimal Balance Points:
- qwen3:14b: Max F1 ≈ 0.56
- deepseek-r1:32b: Max F1 ≈ 0.55
- gpt-oss:20b: Max F1 ≈ 0.45

Conclusion: Beyond certain threshold, merely increasing parameters no longer improves extraction performance.

Ablation Studies

While the paper lacks traditional ablation experiments, implicit analysis through model comparisons reveals:

Model Architecture Impact:

Reasoning models (CoT) do not consistently outperform standard models
deepseek-r1:32b performs excellently, but deepseek-r1 (685B) does not improve further

Model Size Impact:

8B llama3.1 outperforms many larger models
Suggests model quality and training data matter more than pure scale

Annotation Tool Impact:

Label Studio annotation (ann 1) differs systematically from custom tool annotation (ann 2)
All LLMs align more closely with Label Studio annotation

Case Analysis

Paper provides no specific cases, but method description suggests:

Human Annotation Examples:

Directly highlight minimal relevant text fragments in interface
May include original text with grammatical errors

LLM Output Examples (inferred):

Correct cases: Verbatim copying of source text fragments
Error cases: Paraphrasing, grammar correction, or generating non-existent text

Experimental Findings

Non-monotonic Model Size Relationship: Medium-sized models may outperform ultra-large models
Instruction Following Capability Differences: Many LLMs fail to strictly follow "verbatim copy" instructions
Annotation Environment Impact: Different annotation tools produce different granularity annotations
Baseline Method Limitations: Simple word matching methods have reasonable precision but extremely low recall
Cross-language Capability: LLMs perform reasonably on Czech/Slovak, demonstrating multilingual capability
Error Rate and Alignment Incompleteness: Low error rate does not necessarily mean high F1 (e.g., qwen2.5:72b)

1. Automated Fact-Checking

FactLens:

Decomposes complex claims into sub-claims
Independently evaluates each sub-claim veracity
Limitation: Only provides paragraph-level evidence

Loki:

Automated pipeline: identify verifiable claims → retrieve evidence → verify
Limitation: Evidence remains at paragraph level

AmbiFC:

Introduces ambiguity, allowing multiple sentence-level annotations
Shows importance of sentence-level evidence selection
But actual annotation still at paragraph level

2. Fact-Checking Datasets

FEVER:

General claims sourced from Wikipedia
Sentence-level evidence
English data

SciFact:

Rationale annotations in scientific paper abstracts
Sentence-level evidence
English data

This Paper's Dataset Uniqueness:

Czech/Slovak languages
Span-level evidence (finer granularity than sentence-level)
Dual annotation

3. LLM Reasoning Capability

Scaling Laws:

Performance improves with model scale, architectural improvements, and reasoning capability
This paper finds diminishing returns

Multilingual Capability:

Prior work shows LLMs have strong reasoning on Czech and Slovak datasets
This paper validates applicability to fine-grained evidence extraction task

Paper's Positioning

First systematic evaluation of LLM performance on span-level fine-grained evidence extraction
First Czech/Slovak fine-grained evidence dataset
Reveals non-linear relationship between model scale and performance

Conclusions and Discussion

Main Conclusions

Dataset Contribution: Constructs first Czech/Slovak span-level fine-grained evidence dataset with inter-annotator F1 of 47
Error Rate and Model Scale:
- Clear relationship: small models (4B gemma3, 8B mixtral) have >50% error rates
- Requires constrained decoding mechanisms
Performance Diminishing Returns:
- Small to medium scale: performance improvement
- Ultra-large scale (685B, 120B): no further improvement
- Optimal Balance: 14B qwen3, 32B deepseek-r1, 20B gpt-oss
Human Alignment Exceeded: Some LLMs (qwen3:14b, deepseek-r1:32b) achieve F1 exceeding inter-annotator agreement (but only on valid samples)

Limitations

Dataset Scale:
- Only 186 samples
- Some models produce up to 116 invalid outputs
- May introduce evaluation bias
Evaluation Bias:
- Excluding invalid outputs may remove harder samples
- Artificially inflates performance metrics for certain models
Single Task:
- Focuses only on supporting evidence
- Does not analyze refuting evidence
Language Limitation:
- Covers only Czech and Slovak
- Generalization to other languages unknown
Annotation Discrepancy:
- Two annotation tools produce systematic differences
- Requires further analysis of causes
Unconstrained Generation:
- No technical enforcement that spans must be in source text
- Results in high error rates

Future Directions

Constrained Decoding:
- Implement constrained decoding or structured output generation
- Force generation of semantically and structurally valid evidence
- Significantly reduce error outputs
Refuting Evidence:
- Conduct same analysis on refuting evidence
- Improve fact-checking pipeline
Dataset Expansion:
- Increase sample count
- Improve statistical significance
Annotation Discrepancy Analysis:
- Deeply analyze differences between two annotation environments
- Unify annotation standards
End-to-End System:
- Integrate claim extraction, document retrieval, and evidence extraction
- Build complete automated fact-checking system
Multilingual Extension:
- Extend to other languages
- Evaluate cross-language generalization capability

In-Depth Evaluation

Strengths

1. Methodological Innovation

First Span-level Annotation: Finer granularity than existing sentence-level, better aligned with practical needs
Dual Annotation Design: Enables inter-annotator agreement calculation, providing benchmark for LLM evaluation
Hungarian Matching Algorithm: Elegantly solves alignment problem with different exhaustiveness levels, avoiding unfair penalties

2. Experimental Comprehensiveness

Comprehensive Model Coverage: 17 LLMs, parameters from 4B to 685B, covering standard and reasoning models
Multi-dimensional Analysis: Error rates, alignment degree, model size relationships
Baseline Comparisons: Include non-neural baselines and human annotation benchmarks

3. Result Insights

Counter-intuitive Findings: Reveals non-linear relationship between model size and performance
Practical Value: Identifies most cost-effective models (14B-32B)
Honest Reporting: Frankly reports high error rates and evaluation biases

4. Writing Clarity

Clear problem definition (formal definition)
Detailed method description (complete prompts)
Clear result visualization (Figures 1-3)

Weaknesses

1. Methodological Limitations

Unconstrained Generation: No enforcement that spans must be in source text, resulting in 30%-60% invalid outputs
Stopword Handling: Simple removal may lose important information
Single Prompt: Does not explore impact of different prompt strategies

2. Experimental Setup Flaws

Small Sample Size: 186 samples may be insufficient for robust conclusions
Evaluation Bias: Excluding invalid samples may distort performance comparisons
Missing Significance Tests: No statistical significance reported
Single Run: No variance reported across multiple runs

3. Insufficient Analysis

Missing Case Studies: No specific success/failure cases shown
Missing Error Type Analysis: No breakdown of error types (paraphrasing, hallucination, truncation, etc.)
Annotation Discrepancy Unexplained: Finds systematic differences between annotation tools but does not deeply analyze
Cross-language Differences: Does not distinguish Czech vs. Slovak performance

4. Technical Details

Unreported Hyperparameters: LLM temperature, top-p settings not specified
Unreported Inference Costs: Actual computational costs of different-sized models not compared
Unverified Robustness: Does not test robustness to prompt variations, text length, etc.

Impact

1. Contribution to Field

Fills Gap: First Czech/Slovak span-level fine-grained evidence dataset
Methodological Contribution: Hungarian matching for span alignment evaluation
Empirical Insight: Empirical evidence of diminishing returns with model scale

2. Practical Value

Model Selection Guidance: Provides cost-effective model recommendations for deployment
Problem Awareness: Alerts researchers to LLM instruction-following issues
Application Path: Provides technical pathway for online discussion management

3. Reproducibility

Strengths:
- Complete prompts provided (Appendix B)
- Uses mostly open-source models
- Detailed method description
Weaknesses:
- Dataset not publicly released (no release plan mentioned in paper)
- Code not open-sourced
- Specific hyperparameters missing

Applicable Scenarios

Suitable Scenarios

Online Discussion Management: Automatically provide fact-checking evidence for comments
News Platforms: Supplement user comments with contextual information
Educational Applications: Help students learn evidence identification
Research Tools: Assist researchers in literature review

Unsuitable Scenarios

High-Risk Decisions: Medical, legal scenarios requiring 100% accuracy (error rates still high)
Real-time Applications: Ultra-large models (685B) computationally expensive
Low-resource Languages: Method effectiveness on other languages unverified
Long Documents: Handling capability for long text untested

Deployment Recommendations

Recommended Models: qwen3:14b or deepseek-r1:32b (balanced performance and cost)
Necessary Improvements: Implement constrained decoding to reduce error rates
Human Review: Retain human review for high-risk applications
Multilingual Extension: Requires re-evaluation for target languages

Key References

FEVER (Thorne et al., 2018): Large-scale fact extraction and verification dataset, sentence-level evidence
SciFact (Wadden et al., 2020): Scientific claim verification, sentence-level rationale annotation
AmbiFC (Glockner et al., 2024): Introduces ambiguity in fact-checking, emphasizes fine-grained evidence importance
DeepSeek-R1 (Guo et al., 2025): LLM with reasoning incentivized through reinforcement learning
Llama 3 (Grattafiori et al., 2024): Meta's open-source LLM series
Hungarian Algorithm (Kuhn, 1955): Classical algorithm for assignment problems, used for span matching

Summary Evaluation

This paper makes valuable contributions to fine-grained evidence extraction—an important yet understudied task in fact-checking. Greatest strength is constructing the first span-level annotated Czech/Slovak dataset and revealing LLM capabilities and limitations on this task—particularly the diminishing returns of model scale and excellent cost-effectiveness of medium-sized models.

However, main limitations include small sample size (186), high error rates (>50% for some models), and potential evaluation bias from excluding invalid samples. Future work urgently needs constrained decoding implementation and dataset expansion.

Despite limitations, this paper provides important empirical foundation and methodological contributions for building automated fact-checking systems, particularly for resource-limited languages. Recommendation: 4/5 — Valuable exploratory research, but requires follow-up work to address technical issues before practical deployment.