2025-11-18T12:13:13.294087

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Rosenthal, Hanafi, Katsis et al.

Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.

academic

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Basic Information

Paper ID: 2510.11897
Title: A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks
Authors: Sara Rosenthal, Maeda Hanafi, Yannis Katsis, Lucian Popa, Marina Danilevsky (IBM)
Classification: cs.HC (Human-Computer Interaction)
Publication Date: October 2025 (Submitted to ACM)
Paper Link: https://arxiv.org/abs/2510.11897

Abstract

This paper investigates the impact of different annotator feedback loops on data quality in complex Retrieval-Augmented Generation (RAG) tasks. The authors conducted a longitudinal study spanning approximately one year with both internal and external annotator groups, analyzing performance differences in creating multi-turn RAG conversations. The study reveals that tighter feedback loops produce higher-quality conversations but reduce quantity and diversity. The paper provides guidance on optimally leveraging different annotator populations.

Research Background and Motivation

Problem Definition

Core Question: How do different annotator feedback loop structures affect data quality in complex multi-turn RAG conversation creation tasks?
Significance: RAG systems require high-quality benchmark data to evaluate their ability to handle complex questions while avoiding hallucinations and misinformation
Existing Limitations:
- Manual creation of conversational RAG data imposes extreme cognitive demands
- Existing research predominantly assumes direct communication feedback loops, overlooking indirect communication scenarios in practice
- Lack of systematic research on performance differences across annotator populations in complex tasks

Research Motivation

Explore data annotation quality management strategies under real-world constraints
Understand the impact of feedback loop structures on complex annotation tasks
Provide practical guidance for enterprise-level annotation projects

Core Contributions

First systematic study of how different communication feedback loops affect data quality in complex RAG annotation tasks
Key insights discovered: Annotators with tight feedback loops create higher-quality data, while those with loose feedback loops excel in quantity and diversity
Practical strategies provided: Specific quality management recommendations for data creation processes under real-world constraints
Evaluation framework established: Comprehensive assessment of annotator experience and data quality through automated metrics and user research

Methodology Details

Task Definition

Multi-turn RAG conversation creation comprises the following core steps:

Question Creation: Annotators formulate questions relevant to the corpus
Relevant Passage Retrieval: System automatically retrieves relevant document passages
Passage Review and Annotation: Annotators assess passage relevance and re-query when necessary
AI Response Editing: Modify generator output to ensure accuracy and completeness
Label Addition: Add metadata labels for each conversation turn

Experimental Design

Annotator Populations

Internal Annotators (7 people): Same organization as research team, direct communication feedback loop, hourly compensation
External Annotators (40 people): Recruited through external annotation service, indirect communication feedback loop, per-accepted-conversation compensation

Communication Structure Differences

Dimension	Internal Annotators	External Annotators
Communication Mode	Direct (email, Slack, video conference)	Indirect (through intermediary)
Feedback Frequency	Real-time, personalized	Batch, delayed
Training Materials	Slides + direct guidance	Comprehensive video tutorials
Compensation Model	Hourly	Per accepted conversation

Technical Tool: RAGAPHENE

A specialized annotation tool with the following capabilities:

Real-time retrieval and generation
Passage relevance annotation
Response editing and diff visualization
Re-query tools
Quality prompts and checklists

Evaluation Metrics

Conversation Quality Metrics

Average Number of Turns: Conversation length; subsequent turns are typically more challenging
Average Number of Edits: Number of turns modified by annotators, reflecting complexity
Average Number of Queries: Including initial questions and re-queries
Average Number of Unique Passages: Measuring passage diversity

Quality Assessment Methods

Acceptance/Rejection Rate: Conversation quality determined through manual review
Automated Comments: System-generated quality feedback
User Research: Collecting annotators' subjective experience

Experimental Setup

Data Collection Phases

The study spans approximately one year (May 2024 - May 2025) across three phases:

Pilot Phase: Small-scale experiments to calibrate tasks and instructions
Creation Phase: Large-scale conversation creation with improvements based on pilot feedback
Review Phase: Quality review and refinement

Data Scale

Internal Annotators: Approximately 1,500 conversations
External Annotators: Approximately 5,000 conversations
Analysis Subset: 86 from pilot phase, 618 from creation phase, 424 from review phase

Experimental Results

Main Findings

Data Quality Differences

Metric	Internal Annotators	External Annotators
Average Number of Turns	7.6	4.2
Average Number of Edits	7.0	3.0
Average Number of Queries	12.7	6.2
Average Number of Unique Passages	17.1	7.3
Acceptance Rate	87%	69%

Time and Effort Investment

Creation Time: Internal annotators 60-75 minutes/conversation, external annotators 30-45 minutes/conversation
Passage Reading Volume: Internal annotators read more passages on average (6-12 per turn)
Task Understanding: 100% of internal annotators reported correct operational sequence; some misunderstandings among external annotators

Tool Feature Perception Differences

Significant differences exist between internal and external annotators regarding perceived importance of tool features:

Prompt Feature: Largest difference (μ difference=1.41), internal annotators find it more important
Re-query Tool: Higher evaluation by internal annotators (μ difference=0.78)
Passage Marking Feature: More valued by internal annotators (μ difference=0.78)
Response Editing: Similar evaluation between groups (μ difference=0.04)

Synthetic Data Comparison

LLM-generated synthetic conversations are inferior to human-created conversations in both diversity and complexity:

Acceptance Rate: 72% (between the two human annotator groups)
Obvious lack of passage diversity
Absence of human editing and re-query processes

RAG System Research

Benchmark Datasets: RAD-Bench, RAGBench, RGB, MTRAG, etc.
Data Generation Methods: Quality trade-offs between synthetic generation and human annotation
Complexity Requirements: Cognitive burden and quality demands of multi-turn conversations

Data Annotation Quality Management

Annotator Types: Quality differences between experts and crowdsourced workers
Task Complexity: Different management strategies for microtasks vs. macrotasks
Quality Assurance: Filtering strategies, multi-stage processes, expert review

Communication Structure Impact

Feedback Mechanisms: Impact of direct vs. indirect communication on work quality
Collaboration Tools: Interface design supporting complex annotation tasks
Training Materials: Training strategies under different communication structures

Conclusions and Discussion

Main Conclusions

Feedback Loop Impact is Significant: Direct feedback loops substantially improve data quality but reduce output quantity
Complementary Advantages: Internal annotators excel in quality; external annotators excel in quantity and diversity
Tool Design Matters: Prompts and automated feedback can partially compensate for communication limitations
Staged Strategy is Effective: Two-stage creation-review workflow balances quality and efficiency

Practical Recommendations

Task Assignment Strategy

Leverage Internal Annotators to rapidly refine guidance materials
Assign External Annotators targeted, lower-complexity subtasks
Two-Stage Workflow: External creation + internal review

Tool Design Principles

Automated Prompts: Compensate for lack of direct feedback
Fine-Grained Comments: Support specific improvement suggestions
Quality Checks: Automatic validation before export

Training Material Optimization

Utilize Direct Feedback to improve training content
Video Tutorials: Accommodate indirect communication needs
Iterative Improvement: Update materials based on common issues

Limitations

Sample Size: Limited number of internal annotators constrains statistical analysis
Incentive Mechanisms: Different compensation methods may influence work quality
Domain Specificity: Conclusions may not apply to all complex annotation tasks
Temporal Factors: Learning curves and experience accumulation effects insufficiently considered

Future Directions

Expand Research Scale: More annotators and task types
Incentive Mechanism Research: Specific impact of compensation methods on quality
Automated Assistance: Effectiveness evaluation of AI-assisted annotation
Cross-Domain Validation: Verify findings in other complex tasks

In-Depth Evaluation

Strengths

High Practical Value: Addresses critical issues in real-world annotation projects
Rigorous Methodology: Longitudinal study design with multi-dimensional assessment
Meaningful Findings: Reveals significant impact of feedback loops on complex tasks
Strong Guidance: Provides specific, actionable recommendations

Limitations

Insufficient Variable Control: Cannot completely isolate effects of feedback loops from other factors
Limited Generalizability: Research concentrated on RAG tasks; applicability to other domains unknown
Constrained Quantitative Analysis: Small internal annotator sample limits statistical testing power
Unknown Long-term Effects: Lack of observation over longer time spans

Impact

Academic Contribution: Provides new perspective for HCI and NLP intersection
Practical Guidance: Reference framework for enterprise-level annotation projects
Methodological Innovation: Demonstrates systematic research approach for complex task annotation
Tool Value: RAGAPHENE tool has potential for broader application

Applicable Scenarios

Enterprise-Level Annotation Projects: Large-scale data creation requiring quality-efficiency balance
Complex NLP Tasks: Annotation work requiring multiple steps and high cognitive load
Hybrid Annotation Teams: Projects simultaneously using internal and external annotation resources
Quality-Sensitive Applications: AI system development with extreme data quality requirements

References

The paper cites 82 relevant references covering multiple domains including RAG systems, data annotation quality, tool design, and communication structures, providing solid theoretical foundation for the research.

Summary: This is a practically valuable HCI research study that, through rigorous longitudinal study design, reveals the significant impact of feedback loop structures on complex annotation task quality, providing valuable insights and guidance for both academia and industry.