Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.
- Paper ID: 2511.21401
- Title: Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?
- Authors: Antonín Jarolím, Martin Fajčík, Lucia Makaiová (Brno University of Technology, Czech Republic)
- Category: cs.CL (Computational Linguistics)
- Publication Date: November 26, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2511.21401
This paper investigates the capability of Large Language Models (LLMs) to extract fine-grained evidence in fact-checking scenarios, with particular focus on Czech and Slovak languages. The study constructs a dual-annotated dataset containing 186 samples, with each sample annotated by two independent annotators for fine-grained evidence. Evaluation of 17 LLMs of varying sizes (from 4B to 685B parameters) reveals: (1) LLMs frequently fail to verbatim copy evidence from source text, resulting in invalid outputs; (2) despite its small size, llama3.1:8b achieves high accuracy, while gpt-oss-120b underperforms despite having more parameters; (3) qwen3:14b, deepseek-r1:32b, and gpt-oss:20b achieve effective balance between model size and alignment with human annotations.
Online news article comment sections are significant venues for misinformation dissemination. To effectively manage online discussions and combat misinformation, automated systems must be capable of:
- Extracting verifiable claims from user comments
- Retrieving relevant trustworthy documents
- Precisely locating text fragments in documents that support or refute claims (fine-grained evidence)
This paper focuses on the final task—fine-grained evidence extraction.
- User Demand: Over 3/4 of users desire expert responses to comment section discussions, yet manual responses are impractical
- Efficiency and Persuasiveness: Providing entire documents as evidence is too coarse-grained, while fine-grained text fragments enable readers to quickly assess without compromising judgment accuracy
- Platform Practice: X platform (formerly Twitter) uses "Community Notes," and Seznam.cz supplements selected comments with fact-checking information
- Coarse-grained Evidence: Existing automated fact-checking systems (e.g., FactLens, Loki) only provide paragraph-level evidence
- Dataset Gaps: FEVER and SciFact provide sentence-level evidence, but no datasets exist for Czech/Slovak, and existing datasets have sentence-level as their finest granularity, not span-level
- Unknown LLM Capabilities: Despite advancing LLM reasoning abilities, systematic evaluation of alignment between LLM fine-grained evidence extraction and human annotation remains absent
To verify whether LLMs can identify and extract fine-grained evidence like humans, providing technical foundation for building automated fact-checking systems.
- Dataset Construction: Creates a dataset of 186 Czech/Slovak claim-text pairs with fine-grained evidence annotated by two independent annotators, filling the gap for this language pair and span-level annotation
- Systematic LLM Evaluation: Evaluates 17 LLMs of varying sizes (including 685B DeepSeek-R1, 120B gpt-oss and other reasoning models, as well as open-weight models like Gemma-3 and Phi4) on fine-grained evidence extraction
- Error Rate and Alignment Analysis:
- Analyzes error rates of invalid LLM outputs
- Evaluates alignment with human annotation using Hungarian matching algorithm and Token-F1
- Discovers non-linear relationship between model size and performance
- Optimal Model Identification: Identifies medium-sized models (14B-32B) as achieving best balance between efficiency and accuracy
Problem Statement: Given a claim and tokenized text t = (t₁, t₂, ..., tₙ), select a set of spans S = {s₁, s₂, ..., sₘ}, where each span sₘ = (tᵢ, ..., tⱼ) (i ≤ j) represents a continuous subsequence supporting the claim.
Key Constraints:
- Spans must be continuous subsequences in the text
- Select minimal text fragments
- Multiple spans can be selected
- Spans should directly support claim veracity
- Sample Collection: 186 claim-text pairs
- Annotator Pool: 8 non-expert paid annotators
- Independent Annotation: Each sample annotated by two different annotators
- Annotation Tools:
- First annotation: Custom annotation tool
- Second annotation: Label Studio
- Annotation Guidelines:
"Highlight the minimal text portion supporting or refuting the claim. Highlight the part most convincing you that the statement is true."
- Human annotators directly highlight text, ensuring selected spans are continuous in source text
- LLMs must regenerate span text, potentially producing output not in source text
Evaluated three model categories:
1. Standard LLMs (9 models):
- qwen2.5 (72B, 32B)
- llama3.3 (70B)
- llama3.1 (8B)
- gemma2 (27B)
- gemma3 (27B, 12B, 4B)
- phi4 (14B)
- mixtral (8×7B)
2. Chain-of-Thought (CoT) Reasoning Models (8 models):
- deepseek-r1 (685B, 32B)
- gpt-oss (120B, 20B)
- qwen3 (32B, 14B)
LLM input includes:
- Original comment (providing context)
- Extracted claim
- Text for evidence extraction
Key Instructions:
- Identify minimal text portion directly supporting the claim
- Select phrases best proving claim veracity
- Avoid selecting entire sentences unless absolutely necessary
- Multiple spans can be selected
- Do not modify, correct, or rewrite text; preserve all grammatical and syntactic errors
- Output in JSON format:
{"spans": [...]} - Each span must be exact substring of source text (character-for-character identical)
1. Claim Baseline:
- Tokenize claim as c = (c₁, c₂, ..., cₒ)
- Match claim word sequences in text
- Construct span set Sᴄ
2. Query Baseline:
- Use query terms annotators searched with for evidence
- Same matching approach as claim baseline
3. Random Baseline:
- Randomly sample continuous spans
- Span count and length match randomly selected annotator
Remove stopwords from all evidence sets (see Appendix A, including common Czech/Slovak stopwords like "a", "je", "to", etc.)
- Span Pair F1: Calculate token-level F1 scores for all possible span pairs in two annotation sets
- Hungarian Matching: Use Hungarian algorithm to find optimal assignment maximizing total F1
- Final Score: Average F1 of optimal matching as token-level F1 for single data point
Rationale: Since annotators and LLMs may select different span counts (different exhaustiveness), Hungarian algorithm avoids penalizing this difference.
- Error Rate: Proportion of invalid outputs (generated spans not in source text)
- Token-F1: Alignment degree with human annotation
- Inter-annotator Agreement: F1 score between two annotators
- Scale: 186 samples
- Language: Czech and Slovak
- Annotation: 2 independent annotations per sample
- Source: Verifiable claims in online news comments
- Documents: Highly relevant documents found by annotators using search engines
- Invalid %: Percentage of invalid outputs (generated spans not in source text)
- Token-F1: Token-level F1 score based on Hungarian matching (0-100 scale)
- Max F1: F1 score with higher of two annotators (reflects alignment with at least one annotator)
- Human Annotation: ann 1 (LS) and ann 2
- 17 LLMs: Different sizes and architectures
- 3 Baselines: random, claim, query
- Use same prompt template (see Appendix B)
- JSON format output
- No technical constraints enforced (allowing generation of spans not in source text to observe errors)
- Calculate F1 after removing stopwords
Lowest Error Rates:
- qwen2.5:72b: 4.3% (best, 72B parameters)
- deepseek-r1: 7.0% (685B parameters)
- llama3.1:8b: 13.4% (only 8B parameters, excellent performance)
Highest Error Rates:
- mixtral:8x7b: 61.8% (worst, 7B effective parameters)
- gemma3:4b: 57.5% (4B parameters)
- qwen3:14b: 40.3%
Anomalies:
- gpt-oss-120b: 32.8% (120B parameters but high error rate, underperformed expectations)
- llama3.3:70b: 27.4% (70B parameters but relatively high error rate)
Overall Trend: Larger models typically have lower error rates, but significant exceptions exist.
Inter-annotator Agreement:
- ann 1 (LS) vs ann 2: F1 = 48
Best LLM Performance (with ann 1 (LS)):
- qwen3:14b: F1 = 56 (exceeds inter-annotator agreement)
- deepseek-r1:32b: F1 = 55 (exceeds inter-annotator agreement)
- deepseek-r1 (685B): F1 = 38
- qwen2.5:72b: F1 = 43
Alignment with ann 2:
- All LLM F1 scores with ann 2 lower than with ann 1 (LS)
- Indicates different annotation styles produced by two annotation environments
Baseline Performance:
- claim baseline: F1 = 17 (precision ~30, very low recall)
- query baseline: F1 = 12
- random baseline: F1 = 10
All non-neural baseline methods show weak performance (F1 < 18).
Key Findings:
- Small to Medium Scale: Performance improves with scale
- Ultra-large Scale: 685B deepseek-r1 and 120B gpt-oss show no further improvement
- Optimal Balance Points:
- qwen3:14b: Max F1 ≈ 0.56
- deepseek-r1:32b: Max F1 ≈ 0.55
- gpt-oss:20b: Max F1 ≈ 0.45
Conclusion: Beyond certain threshold, merely increasing parameters no longer improves extraction performance.
While the paper lacks traditional ablation experiments, implicit analysis through model comparisons reveals:
Model Architecture Impact:
- Reasoning models (CoT) do not consistently outperform standard models
- deepseek-r1:32b performs excellently, but deepseek-r1 (685B) does not improve further
Model Size Impact:
- 8B llama3.1 outperforms many larger models
- Suggests model quality and training data matter more than pure scale
Annotation Tool Impact:
- Label Studio annotation (ann 1) differs systematically from custom tool annotation (ann 2)
- All LLMs align more closely with Label Studio annotation
Paper provides no specific cases, but method description suggests:
Human Annotation Examples:
- Directly highlight minimal relevant text fragments in interface
- May include original text with grammatical errors
LLM Output Examples (inferred):
- Correct cases: Verbatim copying of source text fragments
- Error cases: Paraphrasing, grammar correction, or generating non-existent text
- Non-monotonic Model Size Relationship: Medium-sized models may outperform ultra-large models
- Instruction Following Capability Differences: Many LLMs fail to strictly follow "verbatim copy" instructions
- Annotation Environment Impact: Different annotation tools produce different granularity annotations
- Baseline Method Limitations: Simple word matching methods have reasonable precision but extremely low recall
- Cross-language Capability: LLMs perform reasonably on Czech/Slovak, demonstrating multilingual capability
- Error Rate and Alignment Incompleteness: Low error rate does not necessarily mean high F1 (e.g., qwen2.5:72b)
FactLens:
- Decomposes complex claims into sub-claims
- Independently evaluates each sub-claim veracity
- Limitation: Only provides paragraph-level evidence
Loki:
- Automated pipeline: identify verifiable claims → retrieve evidence → verify
- Limitation: Evidence remains at paragraph level
AmbiFC:
- Introduces ambiguity, allowing multiple sentence-level annotations
- Shows importance of sentence-level evidence selection
- But actual annotation still at paragraph level
FEVER:
- General claims sourced from Wikipedia
- Sentence-level evidence
- English data
SciFact:
- Rationale annotations in scientific paper abstracts
- Sentence-level evidence
- English data
This Paper's Dataset Uniqueness:
- Czech/Slovak languages
- Span-level evidence (finer granularity than sentence-level)
- Dual annotation
Scaling Laws:
- Performance improves with model scale, architectural improvements, and reasoning capability
- This paper finds diminishing returns
Multilingual Capability:
- Prior work shows LLMs have strong reasoning on Czech and Slovak datasets
- This paper validates applicability to fine-grained evidence extraction task
- First systematic evaluation of LLM performance on span-level fine-grained evidence extraction
- First Czech/Slovak fine-grained evidence dataset
- Reveals non-linear relationship between model scale and performance
- Dataset Contribution: Constructs first Czech/Slovak span-level fine-grained evidence dataset with inter-annotator F1 of 47
- Error Rate and Model Scale:
- Clear relationship: small models (4B gemma3, 8B mixtral) have >50% error rates
- Requires constrained decoding mechanisms
- Performance Diminishing Returns:
- Small to medium scale: performance improvement
- Ultra-large scale (685B, 120B): no further improvement
- Optimal Balance: 14B qwen3, 32B deepseek-r1, 20B gpt-oss
- Human Alignment Exceeded: Some LLMs (qwen3:14b, deepseek-r1:32b) achieve F1 exceeding inter-annotator agreement (but only on valid samples)
- Dataset Scale:
- Only 186 samples
- Some models produce up to 116 invalid outputs
- May introduce evaluation bias
- Evaluation Bias:
- Excluding invalid outputs may remove harder samples
- Artificially inflates performance metrics for certain models
- Single Task:
- Focuses only on supporting evidence
- Does not analyze refuting evidence
- Language Limitation:
- Covers only Czech and Slovak
- Generalization to other languages unknown
- Annotation Discrepancy:
- Two annotation tools produce systematic differences
- Requires further analysis of causes
- Unconstrained Generation:
- No technical enforcement that spans must be in source text
- Results in high error rates
- Constrained Decoding:
- Implement constrained decoding or structured output generation
- Force generation of semantically and structurally valid evidence
- Significantly reduce error outputs
- Refuting Evidence:
- Conduct same analysis on refuting evidence
- Improve fact-checking pipeline
- Dataset Expansion:
- Increase sample count
- Improve statistical significance
- Annotation Discrepancy Analysis:
- Deeply analyze differences between two annotation environments
- Unify annotation standards
- End-to-End System:
- Integrate claim extraction, document retrieval, and evidence extraction
- Build complete automated fact-checking system
- Multilingual Extension:
- Extend to other languages
- Evaluate cross-language generalization capability
- First Span-level Annotation: Finer granularity than existing sentence-level, better aligned with practical needs
- Dual Annotation Design: Enables inter-annotator agreement calculation, providing benchmark for LLM evaluation
- Hungarian Matching Algorithm: Elegantly solves alignment problem with different exhaustiveness levels, avoiding unfair penalties
- Comprehensive Model Coverage: 17 LLMs, parameters from 4B to 685B, covering standard and reasoning models
- Multi-dimensional Analysis: Error rates, alignment degree, model size relationships
- Baseline Comparisons: Include non-neural baselines and human annotation benchmarks
- Counter-intuitive Findings: Reveals non-linear relationship between model size and performance
- Practical Value: Identifies most cost-effective models (14B-32B)
- Honest Reporting: Frankly reports high error rates and evaluation biases
- Clear problem definition (formal definition)
- Detailed method description (complete prompts)
- Clear result visualization (Figures 1-3)
- Unconstrained Generation: No enforcement that spans must be in source text, resulting in 30%-60% invalid outputs
- Stopword Handling: Simple removal may lose important information
- Single Prompt: Does not explore impact of different prompt strategies
- Small Sample Size: 186 samples may be insufficient for robust conclusions
- Evaluation Bias: Excluding invalid samples may distort performance comparisons
- Missing Significance Tests: No statistical significance reported
- Single Run: No variance reported across multiple runs
- Missing Case Studies: No specific success/failure cases shown
- Missing Error Type Analysis: No breakdown of error types (paraphrasing, hallucination, truncation, etc.)
- Annotation Discrepancy Unexplained: Finds systematic differences between annotation tools but does not deeply analyze
- Cross-language Differences: Does not distinguish Czech vs. Slovak performance
- Unreported Hyperparameters: LLM temperature, top-p settings not specified
- Unreported Inference Costs: Actual computational costs of different-sized models not compared
- Unverified Robustness: Does not test robustness to prompt variations, text length, etc.
- Fills Gap: First Czech/Slovak span-level fine-grained evidence dataset
- Methodological Contribution: Hungarian matching for span alignment evaluation
- Empirical Insight: Empirical evidence of diminishing returns with model scale
- Model Selection Guidance: Provides cost-effective model recommendations for deployment
- Problem Awareness: Alerts researchers to LLM instruction-following issues
- Application Path: Provides technical pathway for online discussion management
- Strengths:
- Complete prompts provided (Appendix B)
- Uses mostly open-source models
- Detailed method description
- Weaknesses:
- Dataset not publicly released (no release plan mentioned in paper)
- Code not open-sourced
- Specific hyperparameters missing
- Online Discussion Management: Automatically provide fact-checking evidence for comments
- News Platforms: Supplement user comments with contextual information
- Educational Applications: Help students learn evidence identification
- Research Tools: Assist researchers in literature review
- High-Risk Decisions: Medical, legal scenarios requiring 100% accuracy (error rates still high)
- Real-time Applications: Ultra-large models (685B) computationally expensive
- Low-resource Languages: Method effectiveness on other languages unverified
- Long Documents: Handling capability for long text untested
- Recommended Models: qwen3:14b or deepseek-r1:32b (balanced performance and cost)
- Necessary Improvements: Implement constrained decoding to reduce error rates
- Human Review: Retain human review for high-risk applications
- Multilingual Extension: Requires re-evaluation for target languages
- FEVER (Thorne et al., 2018): Large-scale fact extraction and verification dataset, sentence-level evidence
- SciFact (Wadden et al., 2020): Scientific claim verification, sentence-level rationale annotation
- AmbiFC (Glockner et al., 2024): Introduces ambiguity in fact-checking, emphasizes fine-grained evidence importance
- DeepSeek-R1 (Guo et al., 2025): LLM with reasoning incentivized through reinforcement learning
- Llama 3 (Grattafiori et al., 2024): Meta's open-source LLM series
- Hungarian Algorithm (Kuhn, 1955): Classical algorithm for assignment problems, used for span matching
This paper makes valuable contributions to fine-grained evidence extraction—an important yet understudied task in fact-checking. Greatest strength is constructing the first span-level annotated Czech/Slovak dataset and revealing LLM capabilities and limitations on this task—particularly the diminishing returns of model scale and excellent cost-effectiveness of medium-sized models.
However, main limitations include small sample size (186), high error rates (>50% for some models), and potential evaluation bias from excluding invalid samples. Future work urgently needs constrained decoding implementation and dataset expansion.
Despite limitations, this paper provides important empirical foundation and methodological contributions for building automated fact-checking systems, particularly for resource-limited languages. Recommendation: 4/5 — Valuable exploratory research, but requires follow-up work to address technical issues before practical deployment.