2025-11-18T12:13:13.294087

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Rosenthal, Hanafi, Katsis et al.
Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.
academic

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Basic Information

  • Paper ID: 2510.11897
  • Title: A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks
  • Authors: Sara Rosenthal, Maeda Hanafi, Yannis Katsis, Lucian Popa, Marina Danilevsky (IBM)
  • Classification: cs.HC (Human-Computer Interaction)
  • Publication Date: October 2025 (Submitted to ACM)
  • Paper Link: https://arxiv.org/abs/2510.11897

Abstract

This paper investigates the impact of different annotator feedback loops on data quality in complex Retrieval-Augmented Generation (RAG) tasks. The authors conducted a longitudinal study spanning approximately one year with both internal and external annotator groups, analyzing performance differences in creating multi-turn RAG conversations. The study reveals that tighter feedback loops produce higher-quality conversations but reduce quantity and diversity. The paper provides guidance on optimally leveraging different annotator populations.

Research Background and Motivation

Problem Definition

  1. Core Question: How do different annotator feedback loop structures affect data quality in complex multi-turn RAG conversation creation tasks?
  2. Significance: RAG systems require high-quality benchmark data to evaluate their ability to handle complex questions while avoiding hallucinations and misinformation
  3. Existing Limitations:
    • Manual creation of conversational RAG data imposes extreme cognitive demands
    • Existing research predominantly assumes direct communication feedback loops, overlooking indirect communication scenarios in practice
    • Lack of systematic research on performance differences across annotator populations in complex tasks

Research Motivation

  • Explore data annotation quality management strategies under real-world constraints
  • Understand the impact of feedback loop structures on complex annotation tasks
  • Provide practical guidance for enterprise-level annotation projects

Core Contributions

  1. First systematic study of how different communication feedback loops affect data quality in complex RAG annotation tasks
  2. Key insights discovered: Annotators with tight feedback loops create higher-quality data, while those with loose feedback loops excel in quantity and diversity
  3. Practical strategies provided: Specific quality management recommendations for data creation processes under real-world constraints
  4. Evaluation framework established: Comprehensive assessment of annotator experience and data quality through automated metrics and user research

Methodology Details

Task Definition

Multi-turn RAG conversation creation comprises the following core steps:

  1. Question Creation: Annotators formulate questions relevant to the corpus
  2. Relevant Passage Retrieval: System automatically retrieves relevant document passages
  3. Passage Review and Annotation: Annotators assess passage relevance and re-query when necessary
  4. AI Response Editing: Modify generator output to ensure accuracy and completeness
  5. Label Addition: Add metadata labels for each conversation turn

Experimental Design

Annotator Populations

  • Internal Annotators (7 people): Same organization as research team, direct communication feedback loop, hourly compensation
  • External Annotators (40 people): Recruited through external annotation service, indirect communication feedback loop, per-accepted-conversation compensation

Communication Structure Differences

DimensionInternal AnnotatorsExternal Annotators
Communication ModeDirect (email, Slack, video conference)Indirect (through intermediary)
Feedback FrequencyReal-time, personalizedBatch, delayed
Training MaterialsSlides + direct guidanceComprehensive video tutorials
Compensation ModelHourlyPer accepted conversation

Technical Tool: RAGAPHENE

A specialized annotation tool with the following capabilities:

  • Real-time retrieval and generation
  • Passage relevance annotation
  • Response editing and diff visualization
  • Re-query tools
  • Quality prompts and checklists

Evaluation Metrics

Conversation Quality Metrics

  1. Average Number of Turns: Conversation length; subsequent turns are typically more challenging
  2. Average Number of Edits: Number of turns modified by annotators, reflecting complexity
  3. Average Number of Queries: Including initial questions and re-queries
  4. Average Number of Unique Passages: Measuring passage diversity

Quality Assessment Methods

  • Acceptance/Rejection Rate: Conversation quality determined through manual review
  • Automated Comments: System-generated quality feedback
  • User Research: Collecting annotators' subjective experience

Experimental Setup

Data Collection Phases

The study spans approximately one year (May 2024 - May 2025) across three phases:

  1. Pilot Phase: Small-scale experiments to calibrate tasks and instructions
  2. Creation Phase: Large-scale conversation creation with improvements based on pilot feedback
  3. Review Phase: Quality review and refinement

Data Scale

  • Internal Annotators: Approximately 1,500 conversations
  • External Annotators: Approximately 5,000 conversations
  • Analysis Subset: 86 from pilot phase, 618 from creation phase, 424 from review phase

Experimental Results

Main Findings

Data Quality Differences

MetricInternal AnnotatorsExternal Annotators
Average Number of Turns7.64.2
Average Number of Edits7.03.0
Average Number of Queries12.76.2
Average Number of Unique Passages17.17.3
Acceptance Rate87%69%

Time and Effort Investment

  • Creation Time: Internal annotators 60-75 minutes/conversation, external annotators 30-45 minutes/conversation
  • Passage Reading Volume: Internal annotators read more passages on average (6-12 per turn)
  • Task Understanding: 100% of internal annotators reported correct operational sequence; some misunderstandings among external annotators

Tool Feature Perception Differences

Significant differences exist between internal and external annotators regarding perceived importance of tool features:

  • Prompt Feature: Largest difference (μ difference=1.41), internal annotators find it more important
  • Re-query Tool: Higher evaluation by internal annotators (μ difference=0.78)
  • Passage Marking Feature: More valued by internal annotators (μ difference=0.78)
  • Response Editing: Similar evaluation between groups (μ difference=0.04)

Synthetic Data Comparison

LLM-generated synthetic conversations are inferior to human-created conversations in both diversity and complexity:

  • Acceptance Rate: 72% (between the two human annotator groups)
  • Obvious lack of passage diversity
  • Absence of human editing and re-query processes

RAG System Research

  • Benchmark Datasets: RAD-Bench, RAGBench, RGB, MTRAG, etc.
  • Data Generation Methods: Quality trade-offs between synthetic generation and human annotation
  • Complexity Requirements: Cognitive burden and quality demands of multi-turn conversations

Data Annotation Quality Management

  • Annotator Types: Quality differences between experts and crowdsourced workers
  • Task Complexity: Different management strategies for microtasks vs. macrotasks
  • Quality Assurance: Filtering strategies, multi-stage processes, expert review

Communication Structure Impact

  • Feedback Mechanisms: Impact of direct vs. indirect communication on work quality
  • Collaboration Tools: Interface design supporting complex annotation tasks
  • Training Materials: Training strategies under different communication structures

Conclusions and Discussion

Main Conclusions

  1. Feedback Loop Impact is Significant: Direct feedback loops substantially improve data quality but reduce output quantity
  2. Complementary Advantages: Internal annotators excel in quality; external annotators excel in quantity and diversity
  3. Tool Design Matters: Prompts and automated feedback can partially compensate for communication limitations
  4. Staged Strategy is Effective: Two-stage creation-review workflow balances quality and efficiency

Practical Recommendations

Task Assignment Strategy

  1. Leverage Internal Annotators to rapidly refine guidance materials
  2. Assign External Annotators targeted, lower-complexity subtasks
  3. Two-Stage Workflow: External creation + internal review

Tool Design Principles

  1. Automated Prompts: Compensate for lack of direct feedback
  2. Fine-Grained Comments: Support specific improvement suggestions
  3. Quality Checks: Automatic validation before export

Training Material Optimization

  1. Utilize Direct Feedback to improve training content
  2. Video Tutorials: Accommodate indirect communication needs
  3. Iterative Improvement: Update materials based on common issues

Limitations

  1. Sample Size: Limited number of internal annotators constrains statistical analysis
  2. Incentive Mechanisms: Different compensation methods may influence work quality
  3. Domain Specificity: Conclusions may not apply to all complex annotation tasks
  4. Temporal Factors: Learning curves and experience accumulation effects insufficiently considered

Future Directions

  1. Expand Research Scale: More annotators and task types
  2. Incentive Mechanism Research: Specific impact of compensation methods on quality
  3. Automated Assistance: Effectiveness evaluation of AI-assisted annotation
  4. Cross-Domain Validation: Verify findings in other complex tasks

In-Depth Evaluation

Strengths

  1. High Practical Value: Addresses critical issues in real-world annotation projects
  2. Rigorous Methodology: Longitudinal study design with multi-dimensional assessment
  3. Meaningful Findings: Reveals significant impact of feedback loops on complex tasks
  4. Strong Guidance: Provides specific, actionable recommendations

Limitations

  1. Insufficient Variable Control: Cannot completely isolate effects of feedback loops from other factors
  2. Limited Generalizability: Research concentrated on RAG tasks; applicability to other domains unknown
  3. Constrained Quantitative Analysis: Small internal annotator sample limits statistical testing power
  4. Unknown Long-term Effects: Lack of observation over longer time spans

Impact

  1. Academic Contribution: Provides new perspective for HCI and NLP intersection
  2. Practical Guidance: Reference framework for enterprise-level annotation projects
  3. Methodological Innovation: Demonstrates systematic research approach for complex task annotation
  4. Tool Value: RAGAPHENE tool has potential for broader application

Applicable Scenarios

  1. Enterprise-Level Annotation Projects: Large-scale data creation requiring quality-efficiency balance
  2. Complex NLP Tasks: Annotation work requiring multiple steps and high cognitive load
  3. Hybrid Annotation Teams: Projects simultaneously using internal and external annotation resources
  4. Quality-Sensitive Applications: AI system development with extreme data quality requirements

References

The paper cites 82 relevant references covering multiple domains including RAG systems, data annotation quality, tool design, and communication structures, providing solid theoretical foundation for the research.


Summary: This is a practically valuable HCI research study that, through rigorous longitudinal study design, reveals the significant impact of feedback loop structures on complex annotation task quality, providing valuable insights and guidance for both academia and industry.