Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.
- Paper ID: 2510.11897
- Title: A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks
- Authors: Sara Rosenthal, Maeda Hanafi, Yannis Katsis, Lucian Popa, Marina Danilevsky (IBM)
- Classification: cs.HC (Human-Computer Interaction)
- Publication Date: October 2025 (Submitted to ACM)
- Paper Link: https://arxiv.org/abs/2510.11897
This paper investigates the impact of different annotator feedback loops on data quality in complex Retrieval-Augmented Generation (RAG) tasks. The authors conducted a longitudinal study spanning approximately one year with both internal and external annotator groups, analyzing performance differences in creating multi-turn RAG conversations. The study reveals that tighter feedback loops produce higher-quality conversations but reduce quantity and diversity. The paper provides guidance on optimally leveraging different annotator populations.
- Core Question: How do different annotator feedback loop structures affect data quality in complex multi-turn RAG conversation creation tasks?
- Significance: RAG systems require high-quality benchmark data to evaluate their ability to handle complex questions while avoiding hallucinations and misinformation
- Existing Limitations:
- Manual creation of conversational RAG data imposes extreme cognitive demands
- Existing research predominantly assumes direct communication feedback loops, overlooking indirect communication scenarios in practice
- Lack of systematic research on performance differences across annotator populations in complex tasks
- Explore data annotation quality management strategies under real-world constraints
- Understand the impact of feedback loop structures on complex annotation tasks
- Provide practical guidance for enterprise-level annotation projects
- First systematic study of how different communication feedback loops affect data quality in complex RAG annotation tasks
- Key insights discovered: Annotators with tight feedback loops create higher-quality data, while those with loose feedback loops excel in quantity and diversity
- Practical strategies provided: Specific quality management recommendations for data creation processes under real-world constraints
- Evaluation framework established: Comprehensive assessment of annotator experience and data quality through automated metrics and user research
Multi-turn RAG conversation creation comprises the following core steps:
- Question Creation: Annotators formulate questions relevant to the corpus
- Relevant Passage Retrieval: System automatically retrieves relevant document passages
- Passage Review and Annotation: Annotators assess passage relevance and re-query when necessary
- AI Response Editing: Modify generator output to ensure accuracy and completeness
- Label Addition: Add metadata labels for each conversation turn
- Internal Annotators (7 people): Same organization as research team, direct communication feedback loop, hourly compensation
- External Annotators (40 people): Recruited through external annotation service, indirect communication feedback loop, per-accepted-conversation compensation
| Dimension | Internal Annotators | External Annotators |
|---|
| Communication Mode | Direct (email, Slack, video conference) | Indirect (through intermediary) |
| Feedback Frequency | Real-time, personalized | Batch, delayed |
| Training Materials | Slides + direct guidance | Comprehensive video tutorials |
| Compensation Model | Hourly | Per accepted conversation |
A specialized annotation tool with the following capabilities:
- Real-time retrieval and generation
- Passage relevance annotation
- Response editing and diff visualization
- Re-query tools
- Quality prompts and checklists
- Average Number of Turns: Conversation length; subsequent turns are typically more challenging
- Average Number of Edits: Number of turns modified by annotators, reflecting complexity
- Average Number of Queries: Including initial questions and re-queries
- Average Number of Unique Passages: Measuring passage diversity
- Acceptance/Rejection Rate: Conversation quality determined through manual review
- Automated Comments: System-generated quality feedback
- User Research: Collecting annotators' subjective experience
The study spans approximately one year (May 2024 - May 2025) across three phases:
- Pilot Phase: Small-scale experiments to calibrate tasks and instructions
- Creation Phase: Large-scale conversation creation with improvements based on pilot feedback
- Review Phase: Quality review and refinement
- Internal Annotators: Approximately 1,500 conversations
- External Annotators: Approximately 5,000 conversations
- Analysis Subset: 86 from pilot phase, 618 from creation phase, 424 from review phase
| Metric | Internal Annotators | External Annotators |
|---|
| Average Number of Turns | 7.6 | 4.2 |
| Average Number of Edits | 7.0 | 3.0 |
| Average Number of Queries | 12.7 | 6.2 |
| Average Number of Unique Passages | 17.1 | 7.3 |
| Acceptance Rate | 87% | 69% |
- Creation Time: Internal annotators 60-75 minutes/conversation, external annotators 30-45 minutes/conversation
- Passage Reading Volume: Internal annotators read more passages on average (6-12 per turn)
- Task Understanding: 100% of internal annotators reported correct operational sequence; some misunderstandings among external annotators
Significant differences exist between internal and external annotators regarding perceived importance of tool features:
- Prompt Feature: Largest difference (μ difference=1.41), internal annotators find it more important
- Re-query Tool: Higher evaluation by internal annotators (μ difference=0.78)
- Passage Marking Feature: More valued by internal annotators (μ difference=0.78)
- Response Editing: Similar evaluation between groups (μ difference=0.04)
LLM-generated synthetic conversations are inferior to human-created conversations in both diversity and complexity:
- Acceptance Rate: 72% (between the two human annotator groups)
- Obvious lack of passage diversity
- Absence of human editing and re-query processes
- Benchmark Datasets: RAD-Bench, RAGBench, RGB, MTRAG, etc.
- Data Generation Methods: Quality trade-offs between synthetic generation and human annotation
- Complexity Requirements: Cognitive burden and quality demands of multi-turn conversations
- Annotator Types: Quality differences between experts and crowdsourced workers
- Task Complexity: Different management strategies for microtasks vs. macrotasks
- Quality Assurance: Filtering strategies, multi-stage processes, expert review
- Feedback Mechanisms: Impact of direct vs. indirect communication on work quality
- Collaboration Tools: Interface design supporting complex annotation tasks
- Training Materials: Training strategies under different communication structures
- Feedback Loop Impact is Significant: Direct feedback loops substantially improve data quality but reduce output quantity
- Complementary Advantages: Internal annotators excel in quality; external annotators excel in quantity and diversity
- Tool Design Matters: Prompts and automated feedback can partially compensate for communication limitations
- Staged Strategy is Effective: Two-stage creation-review workflow balances quality and efficiency
- Leverage Internal Annotators to rapidly refine guidance materials
- Assign External Annotators targeted, lower-complexity subtasks
- Two-Stage Workflow: External creation + internal review
- Automated Prompts: Compensate for lack of direct feedback
- Fine-Grained Comments: Support specific improvement suggestions
- Quality Checks: Automatic validation before export
- Utilize Direct Feedback to improve training content
- Video Tutorials: Accommodate indirect communication needs
- Iterative Improvement: Update materials based on common issues
- Sample Size: Limited number of internal annotators constrains statistical analysis
- Incentive Mechanisms: Different compensation methods may influence work quality
- Domain Specificity: Conclusions may not apply to all complex annotation tasks
- Temporal Factors: Learning curves and experience accumulation effects insufficiently considered
- Expand Research Scale: More annotators and task types
- Incentive Mechanism Research: Specific impact of compensation methods on quality
- Automated Assistance: Effectiveness evaluation of AI-assisted annotation
- Cross-Domain Validation: Verify findings in other complex tasks
- High Practical Value: Addresses critical issues in real-world annotation projects
- Rigorous Methodology: Longitudinal study design with multi-dimensional assessment
- Meaningful Findings: Reveals significant impact of feedback loops on complex tasks
- Strong Guidance: Provides specific, actionable recommendations
- Insufficient Variable Control: Cannot completely isolate effects of feedback loops from other factors
- Limited Generalizability: Research concentrated on RAG tasks; applicability to other domains unknown
- Constrained Quantitative Analysis: Small internal annotator sample limits statistical testing power
- Unknown Long-term Effects: Lack of observation over longer time spans
- Academic Contribution: Provides new perspective for HCI and NLP intersection
- Practical Guidance: Reference framework for enterprise-level annotation projects
- Methodological Innovation: Demonstrates systematic research approach for complex task annotation
- Tool Value: RAGAPHENE tool has potential for broader application
- Enterprise-Level Annotation Projects: Large-scale data creation requiring quality-efficiency balance
- Complex NLP Tasks: Annotation work requiring multiple steps and high cognitive load
- Hybrid Annotation Teams: Projects simultaneously using internal and external annotation resources
- Quality-Sensitive Applications: AI system development with extreme data quality requirements
The paper cites 82 relevant references covering multiple domains including RAG systems, data annotation quality, tool design, and communication structures, providing solid theoretical foundation for the research.
Summary: This is a practically valuable HCI research study that, through rigorous longitudinal study design, reveals the significant impact of feedback loop structures on complex annotation task quality, providing valuable insights and guidance for both academia and industry.