2025-11-25T18:49:17.995403

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Li, Fu, Wang et al.

Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

academic

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

基本信息

论文ID: 2510.07414
标题: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
作者: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li
机构: Georgia Institute of Technology, Meta AI, University of Illinois Urbana-Champaign, National University of Singapore
分类: cs.CL, cs.AI, cs.IR
发表时间: 2025年10月 (Preprint)
论文链接: https://arxiv.org/abs/2510.07414

摘要

现代长上下文大语言模型在合成的"大海捞针"(NIAH)基准测试中表现良好，但这些测试忽略了噪声上下文如何从有偏检索和智能体工作流中产生。本文提出了干草堆工程(haystack engineering)的概念，用于构建忠实捕获关键现实因素的噪声长上下文——来自异构有偏检索器的干扰和智能体工作流中的级联错误——以测试模型的长上下文鲁棒性。作者通过HaystackCraft实现了这一概念，这是一个基于完整英文维基百科超链接网络和多跳问题构建的新NIAH基准。实验结果显示，即使是Gemini 2.5 Pro和GPT-5等先进模型在智能体测试中也会遭受级联失败或难以执行早期停止。

研究背景与动机

核心问题

现有的长上下文评估基准存在显著的模拟与现实差距：

静态合成基准的局限性：传统的NIAH测试使用查询无关的干扰项，而实际应用中的长上下文是通过RAG等检索策略构建的，具有检索器依赖的特性。
忽略检索异构性：不同检索策略（稀疏、密集、混合、图基检索）会引入不同类型的干扰项，但现有基准未考虑这种异构性对模型性能的影响。
缺乏动态智能体评估：现有基准都是静态的、单轮的、LLM无关的，无法评估智能体上下文工程中的级联错误问题。

研究动机

作者认为需要"干草堆工程"来构建现实的噪声长上下文，以忠实地模拟实际应用中的复杂性和失败模式。这与"上下文工程"形成对比：后者寻求最优条件，前者强调忠实的干草堆构建。

核心贡献

提出干草堆工程概念：首次系统性地研究了检索策略对长上下文评估的影响，将NIAH问题从RAG角度重新形式化。
构建HaystackCraft基准：
- 基于完整英文维基百科超链接网络（6,954,909篇文章，97,442,472个超链接）
- 包含多跳问答任务，支持异构检索策略评估
- 首个动态、多轮、LLM依赖的NIAH测试环境
全面的异构检索评估：系统评估了稀疏(BM25)、密集(Qwen3-Embedding)、混合和图基(PPR)检索策略对干扰项组成和模型性能的影响。
揭示智能体长上下文挑战：通过动态NIAH测试发现，即使先进模型在智能体工作流中也容易出现级联失败，且模型对"宽度"（长上下文）比对"深度"（推理迭代）更鲁棒。