Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
- Paper ID: 2510.10415
- Title: LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints
- Authors: Federica Bologna (Cornell University), Tiffany Pan (Cornell University), Matthew Wilkens (Cornell University), Yue Guo (University of Illinois, Urbana-Champaign), Lucy Lu Wang (University of Washington)
- Classification: cs.CL cs.AI
- Publication Date: October 12, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.10415v1
Evaluating long-form clinical question-answering systems is both resource-intensive and challenging: accurate assessment requires medical expertise, while achieving consensus on long-form text through human annotation is extremely difficult. This paper introduces LONGQAEVAL, an evaluation framework and set of recommendations designed for resource-constrained, high-expertise environments. Based on annotations by physicians for 300 real patient questions (including responses from both physicians and LLMs), the study compares coarse-grained answer-level versus fine-grained sentence-level evaluation across three dimensions: correctness, relevance, and safety. The research reveals that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse-grained annotation improves agreement on relevance, while safety judgments remain inconsistent. Furthermore, annotating only a small subset of sentences provides reliability comparable to coarse-grained annotation, thereby reducing costs and effort.
As healthcare costs rise and provider accessibility remains limited, patients struggle to obtain timely answers to clinical questions. While generative models integrated into electronic health record (EHR) systems may help, evaluating their responses requires medical expertise.
- Scarcity and high cost of expert annotators: Medical expert evaluation is expensive and limited in availability
- Low inter-annotator agreement: Experts frequently disagree on standards for "good answers"
- Difficulty in evaluating long-form text: Achieving consensus on long-form generated text is challenging
- Annotation fatigue: Complex annotation tasks lead to degraded annotation quality
- Most clinical QA research uses answer-level evaluation, which obscures mixed-quality content
- Lack of standardized evaluation frameworks and detailed annotation guidelines
- Rarely report inter-annotator agreement, affecting result credibility
- Lack of systematic investigation into optimal annotation granularity for different evaluation dimensions
- Constructed a dataset of 300 question-answer pairs annotated by 6 medical experts across correctness, relevance, and safety dimensions
- Proposed the LONGQAEVAL annotation framework supporting both coarse-grained and fine-grained evaluation modes
- Conducted randomized human annotation studies systematically comparing coarse-grained and fine-grained annotation effects
- Provided practical recommendations to help clinical LLM developers select optimal annotation designs
- Evaluated two widely-used LLMs (GPT-4 and Llama-3.1-Instruct-405B) on long-form clinical QA
- Analyzed the generalization capability of the annotation framework in LLM-as-judge settings
This study evaluates long-form clinical question-answering systems across three key dimensions:
- Correctness: Whether the answer aligns with current medical knowledge
- Relevance: Whether the answer directly addresses the specific medical question
- Safety: Whether the answer conveys contraindications or risks
- Coarse-grained annotation: Evaluators view the question and complete answer, scoring each dimension on a 5-point Likert scale
- Fine-grained annotation: Evaluators view the question and individual highlighted sentences from the answer, assessing each dimension within sentence context
- Randomly sampled 100 real patient questions from the K-QA dataset
- Generated answers using GPT-4 and Llama-3.1-Instruct-405B
- Employed 5-shot in-context learning and chain-of-thought reasoning
- Limited answer length to 270 words (consistent with physician answer length)
- Annotators: 6 practicing physicians from Upwork with 3-15 years of patient care experience
- Group design: Divided into two groups of 3 annotators each, each responsible for all answers to 50 questions
- Alternating design: Each annotator performed half their tasks using coarse-grained and half using fine-grained annotation
- Quality control: Included duplicate annotations to measure intra-rater reliability (IRR)
Unlike one-size-fits-all approaches, this study found that different evaluation dimensions require different annotation granularities:
- Factual dimensions (e.g., correctness) are suited to fine-grained annotation
- Context-dependent dimensions (e.g., relevance) are suited to coarse-grained annotation
Proposed that annotating only 3 sentences achieves reliability comparable to complete fine-grained annotation, substantially reducing costs.
Fine-grained annotation helps mitigate systematic biases related to answer length, ensuring shorter physician answers are not systematically underestimated.
- K-QA Dataset: Contains real patient questions covering general primary care topics
- Sample Size: 100 questions, 300 question-answer pairs (3 answers per question)
- Answer Sources: Physician answers (106±54 words), GPT-4 answers (124±50 words), Llama answers (170±52 words)
- Inter-Annotator Agreement (IAA): Using Randolph's κ
- Intra-Rater Reliability (IRR): Using percentage agreement
- Annotator Confidence: 5-point Likert scale
- Annotation Time: Task completion time in seconds
- NASA-TLX Scale: Measuring perceived workload
- Coarse-grained vs. fine-grained annotation
- Complete fine-grained vs. partial fine-grained annotation (3 vs. 6 sentences)
- Human experts vs. LLM-as-judge (GPT-4o)
- Correctness: Fine-grained annotation significantly improves IAA (0.90 vs. 0.74)
- Relevance: Coarse-grained annotation performs better (0.71 vs. 0.32)
- Safety: Both methods perform poorly, but fine-grained shows slight improvement
- Annotating only 3 sentences achieves correlation coefficients exceeding 0.8 with complete 6-sentence annotation
- 3-sentence annotation variance is lower than coarse-grained annotation on correctness and safety dimensions
- Annotation time reduced from 459.8 seconds (complete fine-grained) to comparable coarse-grained levels (239.3 seconds)
- LLM Performance: GPT-4 and Llama are comparable to or superior to physicians on correctness
- Relevance Advantage: Both LLMs perform better in addressing patient concerns
- Safety Deficiency: All systems (including physicians) perform poorly on the safety dimension
Fine-grained annotation reveals length bias present in coarse-grained evaluation:
- In coarse-grained evaluation, physician answers receive lower correctness scores (0.78 vs. 0.92-0.93)
- In fine-grained evaluation, physician answer correctness scores improve significantly (0.99)
- GPT-4o as judge shows agreement with experts comparable to or exceeding inter-expert agreement on correctness and relevance dimensions
- Fine-grained instructions' effectiveness in improving LLM-expert agreement varies by aggregation method
- 3-point scales outperform binary scales in LLM evaluation
Existing clinical QA benchmarks largely employ rough classification guidelines lacking detailed annotation guidance. MultiMedQA and MedQA use three-level scales, while HealthBench and MEDIC employ general Likert scales, but these approaches lack standardization, resulting in poor consistency and reproducibility.
Most clinical QA work uses answer-level evaluation, which obscures mixed-quality content. Krishna et al. found that sentence-level evaluation improves IAA for faithfulness in summarization tasks, but its applicability to other dimensions and high-stakes domains remains unclear.
This study identifies three core evaluation dimensions (correctness, relevance, safety) based on prior work, which are frequently used in clinical QA evaluation.
- Dimension-specific strategies: Different evaluation dimensions require different annotation granularity designs
- Cost-benefit balance: Partial fine-grained annotation significantly reduces costs while maintaining quality
- Bias mitigation: Fine-grained annotation helps reduce length-related systematic biases
- LLM performance: Current advanced LLMs perform well on correctness and relevance, but safety requires improvement
- Correctness evaluation: Use fine-grained or partial fine-grained annotation (3 sentences)
- Relevance evaluation: Use coarse-grained annotation
- Safety evaluation: Requires further research to improve evaluation methods
- LLM-as-judge: Can supplement expert judgment, particularly on correctness and relevance dimensions
- Dataset size: Contains only general primary care questions, may not apply to specialty care
- Number of annotators: Only 6 experts, limiting perspective diversity
- IRR sample size: Limited duplicate annotation samples restrict reliability assessment precision
- Model scope: Only two LLMs evaluated, limiting generalizability of results
- Expand to larger datasets and more annotators
- Investigate evaluation methods for specialty medical questions
- Improve safety evaluation framework
- Explore performance of additional LLMs
- Systematic research design: Employs randomized controlled experiments with strict confounding factor control
- High practical value: Provides concrete, actionable evaluation guidance
- Cost-conscious: Fully considers practical needs under resource constraints
- Multi-dimensional analysis: Focuses not only on accuracy but also on time, confidence, and other metrics
- High transparency: Plans to open-source data and code for reproducibility and extension
- Limited sample size: 300 question-answer pairs is relatively small, potentially affecting generalizability
- Domain limitations: Covers only general primary care; applicability to specialty medicine unknown
- Insufficient safety evaluation: This dimension's evaluation methods still require substantial improvement
- Homogeneous cultural background: Annotator backgrounds may affect cross-cultural applicability
- Academic contribution: Provides important methodological guidance for clinical NLP evaluation
- Practical value: Directly informs evaluation practices for clinical AI systems
- Standardization advancement: Contributes to establishing more standardized clinical QA evaluation processes
- Cross-domain inspiration: Evaluation methods may apply to other high-expertise domains
- Clinical AI system evaluation: Assessment before healthcare institutions deploy AI question-answering systems
- Research benchmarks: Standard evaluation protocols for academic research
- Regulatory review: Regulatory evaluation frameworks for medical AI systems
- Product development: Quality assessment for medical technology companies
The paper cites multiple important related works, including:
- Krishna et al. (2023) on guiding principles for long-form summarization evaluation
- Singhal et al. (2023) on large language models encoding clinical knowledge
- Ayers et al. (2023) comparing physician and AI chatbot responses
- Multiple related works on clinical QA benchmarks and evaluation frameworks
Overall Assessment: This is a high-quality methodological research paper providing important empirical guidance for clinical question-answering system evaluation. The research design is rigorous, results are practically valuable, and it significantly contributes to advancing standardization of medical AI evaluation. Despite limitations in sample size and domain coverage, the proposed evaluation framework and findings establish an important foundation for the field's development.