2025-11-18T12:13:13.294087

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

Rosenthal, Hanafi, Katsis et al.

Grounding conversations in existing passages, known as Retrieval-Augmented Generation (RAG), is an important aspect of Chat-Based Assistants powered by Large Language Models (LLMs) to ensure they are faithful and don't provide misinformation. Several benchmarks have been created to measure the performance of LLMs on this task. We present a longitudinal study comparing the feedback loop of an internal and external human annotator group for the complex annotation task of creating multi-turn RAG conversations for evaluating LLMs. We analyze the conversations produced by both groups and provide results of a survey comparing their experiences. Our study highlights the advantages of each annotator population and the impact of the different feedback loops; a closer loop creates higher quality conversations with a decrease in quantity and diversity. Further, we present guidance for how to best utilize two different population groups when performing annotation tasks, particularly when the task is complex.

academic

A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks

基本信息

论文ID: 2510.11897
标题: A Longitudinal Study on Different Annotator Feedback Loops in Complex RAG Tasks
作者: Sara Rosenthal, Maeda Hanafi, Yannis Katsis, Lucian Popa, Marina Danilevsky (IBM)
分类: cs.HC (Human-Computer Interaction)
发表时间: 2025年10月 (提交至ACM)
论文链接: https://arxiv.org/abs/2510.11897

核心问题: 在复杂的多轮RAG对话创建任务中，不同的标注员反馈循环结构如何影响数据质量？
重要性: RAG系统需要高质量的基准数据来评估其处理复杂问题的能力，避免幻觉和错误信息
现有局限性:
- 手动创建对话式RAG数据在认知上要求极高
- 现有研究多假设直接沟通反馈循环，忽略了现实中的间接沟通场景
- 缺乏对不同标注员群体在复杂任务中表现差异的系统性研究

研究动机

探索现实世界约束下的数据标注质量管理策略
理解反馈循环结构对复杂标注任务的影响
为企业级标注项目提供实用指导

核心贡献

首次系统研究了不同沟通反馈循环对复杂RAG标注任务数据质量的影响
发现关键洞察：紧密反馈循环的标注员创建更高质量数据，但松散反馈循环的标注员在数量和多样性方面有优势
提供实用策略：针对现实约束下的数据创建过程，提出了具体的质量管理建议
构建评估框架：通过自动化指标和用户调研，全面评估标注员体验和数据质量

方法详解

任务定义

多轮RAG对话创建包含以下核心步骤：

创建问题：标注员提出与语料库相关的问题
检索相关段落：系统自动检索相关文档段落
审查和标注段落：标注员评估段落相关性，必要时重新查询
编辑AI回答：修改生成器输出以确保准确性和完整性
添加标签：为每轮对话添加元数据标签

实验设计

标注员群体

内部标注员（7人）：与研究团队同组织，直接沟通反馈循环，按小时付费
外部标注员（40人）：通过外部标注服务招募，间接沟通反馈循环，按对话付费

沟通结构差异

维度	内部标注员	外部标注员
沟通方式	直接（邮件、Slack、视频会议）	间接（通过中介）
反馈频率	实时、个性化	批量、延迟
培训材料	幻灯片+直接指导	综合视频教程
付费方式	按小时	按接受的对话数