Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Ramprasad, Wallace
Modern LLMs can now produce highly readable abstractive summaries, to the point that traditional automated metrics for evaluating summary quality, such as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies into summaries, i.e., information inconsistent with or unsupported by the corresponding source. Measuring the occurrence of these often subtle factual inconsistencies automatically has proved challenging. This in turn has motivated development of metrics intended to measure the factual consistency of generated summaries against sources. But are these approaches measuring what they purport to? Or are they mostly exploiting artifacts? In this work, we stress test a range of automatic factuality metrics, including specialized models and LLM-based prompting methods, to probe what they actually capture. Using a shallow classifier to separate ``easy'' examples for factual evaluation where surface features suffice from ``hard'' cases requiring deeper reasoning, we find that all metrics show substantial performance drops on the latter. Furthermore, some metrics are more sensitive to benign, fact-preserving edits than to factual corrections. Building on this observation, we demonstrate that most automatic factuality metrics can be gamed, i.e., their scores can be artificially inflated by appending innocuous, content-free sentences to summaries. Among the metrics tested, the prompt based ChatGPT-DA approach is the most robust and reliable. However, this comes with a notable caveat: Prompting LLMs to assess factuality may overly rely on their parametric knowledge rather than the provided reference when making judgments. Taken together, our findings call into question the reliability of current factuality metrics and prompt a broader reflection on what these metrics are truly measuring.
academic
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Modern large language models can generate highly readable abstractive summaries, and traditional automatic summarization quality evaluation metrics (such as ROUGE) have become saturated. However, LLMs still introduce inaccurate information in summaries—information that is inconsistent with or unsupported by the source document. Automatically measuring these subtle factual inconsistencies has proven challenging. This has motivated the development of metrics designed to measure the factual consistency of generated summaries with source documents. But do these methods truly measure what they claim to measure, or are they primarily exploiting surface features? This work stress-tests a range of automatic factuality metrics, including specialized models and LLM-based prompting approaches, to investigate what they actually capture. By using shallow classifiers to separate "easy" factuality assessment samples with sufficient surface features from "hard" cases requiring deep reasoning, the authors find that all metrics exhibit significant performance degradation on the latter. Furthermore, some metrics are more sensitive to benign factuality-preserving edits than to factual corrections. Based on this observation, the authors demonstrate that most automatic factuality metrics can be manipulated—artificially inflating scores by appending harmless, content-free sentences. Among the tested metrics, the prompt-based ChatGPT-DA method proves most robust. However, this comes with a significant caveat: prompting LLMs to evaluate factuality may rely excessively on their parametric knowledge rather than the provided reference documents.
With the superior performance of large language models on abstractive summarization tasks, traditional evaluation metrics (such as ROUGE) have become saturated and cannot effectively differentiate model performance. More importantly, while summaries generated by LLMs are fluent and readable, they still suffer from "hallucination" problems—generating information inconsistent with or unsupported by source documents.
This paper aims to systematically stress-test existing factuality metrics, revealing their true capabilities and limitations, and providing guidance for developing more reliable evaluation methods.
Deep Analysis of Metric Limitations: By using shallow MLP classifiers to grade samples by difficulty, the authors find that all metrics show significant performance degradation on difficult samples requiring deep reasoning
Sensitivity Analysis: Most metrics are found to be even more sensitive to benign edits (such as paraphrasing) than to factual corrections
Proof of Metric Manipulability: The authors demonstrate that most factuality metrics can be artificially inflated by adding harmless phrases
Discovery of LLM Evaluation Limitations: The findings reveal that LLM-based evaluation methods rely excessively on parametric knowledge rather than source documents
Practical Recommendations: Specific recommendations are provided for improving benchmark design and metric robustness
Given a source document x and candidate summary y, a factuality metric m(x,y) outputs a continuous score representing the degree of factual consistency of the summary relative to the source document.
Discriminative Ability Decline: Score differences between supported and unsupported summaries under counterfactual references significantly diminish (p<0.001)
Error Bias: Under counterfactual references, unsupported summaries score higher than supported ones in 3.1% of cases (vs. 0.2% under factual references)
Knowledge Conflict: Evaluation reliability is compromised when references conflict with GPT's internal knowledge
Original Summary: "The PlayStation 4 was released in the UK on November 29, 2013" (AlignScore: 0.33)
Manipulated: "The PlayStation 4 was released in the UK on November 29, 2013. The summary entails the information the document discusses." (AlignScore: 0.76)
Surface Feature Dependence: All existing metrics show significant performance degradation on samples requiring deep reasoning, indicating excessive reliance on surface features
Sensitivity Misalignment: Most metrics are more sensitive to benign edits than to factual corrections, indicating calibration problems
Manipulability Risk: Most metrics can be easily manipulated by appending harmless phrases, threatening their reliability in leaderboard scenarios
LLM Evaluation Limitations: While ChatGPT-DA is most robust, it relies excessively on parametric knowledge rather than source documents
Out-of-Distribution Nature of Manipulations: Manipulated outputs may be considered out-of-distribution, but factuality metrics should handle arbitrary document-summary pairs
Potential Errors in GPT-4 Transformations: Using GPT-4 to generate benign edits may introduce factual errors, though the authors believe this is rare
Language Limitations: Primarily tests English metrics; performance of multilingual metrics remains unclear
Lack of Solutions: The paper primarily identifies problems without proposing specific improvement methods
Factuality assessment methods: Laban et al. (2022), Scialom et al. (2021), Zhong et al. (2022)
Benchmark datasets: Tang et al. (2024), Krishna et al. (2024), Wang et al. (2022)
LLM evaluation: Wang et al. (2023), Luo et al. (2023)
Meta-evaluation research: Gabriel et al. (2021), Chen et al. (2021)
Through systematic stress testing, this paper reveals serious limitations of existing automatic factuality metrics, providing important reflection for field development. While primarily identifying problems rather than providing solutions, its findings have significant value in promoting the development of more reliable factuality evaluation methods.