2025-11-25T18:49:17.995403

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Li, Fu, Wang et al.
Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
academic

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Basic Information

  • Paper ID: 2510.07414
  • Title: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
  • Authors: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li
  • Institutions: Georgia Institute of Technology, Meta AI, University of Illinois Urbana-Champaign, National University of Singapore
  • Classification: cs.CL, cs.AI, cs.IR
  • Publication Date: October 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.07414

Abstract

Modern long-context large language models perform well on synthetic "Needle in a Haystack" (NIAH) benchmarks, yet these tests overlook how noisy contexts arise from biased retrieval and agentic workflows. This paper introduces the concept of haystack engineering for constructing realistic noisy long contexts that faithfully capture critical real-world factors—interference from heterogeneous biased retrievers and cascading errors in agentic workflows—to test models' long-context robustness. The authors implement this concept through HaystackCraft, a novel NIAH benchmark constructed on the complete English Wikipedia hyperlink network and multi-hop questions. Experimental results demonstrate that even advanced models like Gemini 2.5 Pro and GPT-5 suffer from cascading failures or struggle with early stopping in agentic tests.

Research Background and Motivation

Core Problems

Existing long-context evaluation benchmarks exhibit significant gaps between simulation and reality:

  1. Limitations of Static Synthetic Benchmarks: Traditional NIAH tests use query-irrelevant distractors, whereas real-world long contexts are constructed through retrieval strategies like RAG, exhibiting retriever-dependent characteristics.
  2. Neglect of Retrieval Heterogeneity: Different retrieval strategies (sparse, dense, hybrid, graph-based) introduce different types of distractors, yet existing benchmarks fail to account for this heterogeneity's impact on model performance.
  3. Absence of Dynamic Agentic Evaluation: Existing benchmarks are static, single-turn, and LLM-agnostic, unable to evaluate cascading error problems in agentic context engineering.

Research Motivation

The authors argue for "haystack engineering" to construct realistic noisy long contexts that faithfully simulate real-world complexity and failure patterns. This contrasts with "context engineering," which seeks optimal conditions, whereas the former emphasizes faithful haystack construction.

Core Contributions

  1. Proposes Haystack Engineering Concept: Systematically investigates retrieval strategy impacts on long-context evaluation for the first time, reformulating NIAH problems from a RAG perspective.
  2. Constructs HaystackCraft Benchmark:
    • Based on complete English Wikipedia hyperlink network (6,954,909 articles, 97,442,472 hyperlinks)
    • Includes multi-hop QA tasks supporting heterogeneous retrieval strategy evaluation
    • First dynamic, multi-turn, LLM-dependent NIAH testing environment
  3. Comprehensive Heterogeneous Retrieval Evaluation: Systematically evaluates sparse (BM25), dense (Qwen3-Embedding), hybrid, and graph-based (PPR) retrieval strategies' effects on distractor composition and model performance.
  4. Reveals Agentic Long-Context Challenges: Through dynamic NIAH testing, discovers that even advanced models suffer cascading failures in agentic workflows, with models showing greater robustness to "width" (long context) than "depth" (reasoning iterations).

Methodology Details

Task Definition

Reformulates NIAH problem from RAG perspective:

  • Given document corpus D and query q
  • True supporting document set N_q ⊂ D (needles)
  • Retrieval strategy R scores and ranks all documents in D
  • Constructs haystack H^R_q(S): contains all needle documents and top-ranked distractors, totaling S tokens

Static NIAH Evaluation

Heterogeneous Retrieval Strategies

  1. Sparse Retrieval (BM25): Classical lexical similarity-based method
  2. Dense Retrieval (Qwen3-Embedding-0.6B): Captures semantic similarity
  3. Hybrid Retrieval: Combines sparse and dense retrieval using Reciprocal Rank Fusion (RRF)
  4. Graph-based Reranking: Integrates structural information using Personalized PageRank (PPR)

Haystack Ordering Strategies

  • Retriever Ordering: Ranked by retrieval scores (realistic RAG setting)
  • Random Ordering: Randomly shuffled (diagnostic for position bias)

Dynamic NIAH Evaluation

Agentic Operation Modeling

Extends static NIAH to support multi-turn interactions:

  • Query Refinement: Optimizes queries based on retrieval results
  • Self-Reflection: Summarizes previous analyses
  • Stopping Decision: Determines when to terminate reasoning

Two Dynamic Settings

  1. Forced Multi-turn: Fixed reasoning iterations, tests cascading error robustness
  2. Variable Turns: Model autonomously decides stopping point, tests early stopping capability

Technical Innovations

  1. Retriever-Distractor Composition Mapping: First systematic study of how different retrieval methods shape distractor characteristics
  2. Graph Structure Utilization: Models multi-hop QA as "needle subgraph" identification problem
  3. Dynamic Context Engineering: Novel evaluation paradigm where LLM serves as both reasoner and distractor source
  4. Width vs. Depth Analysis: Distinguishes effects of long-context "width" and reasoning "depth"

Experimental Setup

Dataset

  • Corpus: English Wikipedia dump from 2025-04-04, using complete articles as retrieval units
  • QA Dataset:
    • Natural Questions (NQ): Single-hop questions
    • MuSiQue: Multi-hop questions (up to 4 supporting documents)
    • Manually filtered to 500 high-quality samples

Model Coverage

Evaluates 15 long-context LLMs:

  • Reasoning Models: Qwen3 series, Gemini 2.5 Flash-Lite, o4-mini
  • General Models: GPT-4.1 mini, Llama-3.1 series, Qwen2.5-1M, Gemma 3 series
  • Top-tier Models: Gemini 2.5 Pro, GPT-5 (dynamic testing)

Evaluation Metrics

  • Retrieval Performance: Recall@N, NDCG@N
  • QA Performance: F1 score
  • Context Sizes: 8K, 16K, 32K, 64K, 128K tokens

Implementation Details

  • Uses Qwen2.5-1M tokenizer for unified token counting
  • PPR hyperparameters optimized via grid search
  • vLLM employed for inference acceleration

Experimental Results

Main Findings

1. Retrieval Strategy Significantly Impacts Haystack Difficulty

  • Dense Retrieval More Challenging: In 11/12 cases, dense retrievers introduce more difficult distractors than sparse retrievers
  • Hybrid Retrieval Not Necessarily Harder: Despite better retrieval performance, doesn't necessarily introduce more challenging distractors
  • Graph-based Reranking Dual Benefits: Simultaneously improves retrieval effectiveness and mitigates harmful distractors, with NIAH performance gains up to 44%

2. Model-Dependent Effects of Haystack Ordering

  • Highly Model-Correlated: Vast differences in model responses to retriever ordering
  • Significant Benefits for Some Models: Gemma-3 and Qwen2.5-1M series gain significant and increasing benefits from retriever ordering
  • Evaluation Necessity: Requires simultaneous evaluation of retriever ordering and random ordering for comprehensive model behavior understanding

3. Dynamic NIAH Reveals Agentic Vulnerability

Forced Multi-turn Results:

  • All models (including GPT-5, Gemini 2.5 Pro) susceptible to cascading errors
  • Performance deteriorates with additional iterations, often amplifying early errors
  • Static NIAH performance cannot predict multi-turn robustness

Variable Turns Results:

  • No model reliably improves single-turn performance
  • GPT-5 performs relatively best but fails to convert multi-turn reasoning into sustained improvement
  • Models universally lack effective early stopping mechanisms

Specific Numerical Results

Retrieval Performance (Recall@160)

  • BM25: 58.73% → BM25+PPR: 66.58% (+7.85%)
  • Qwen3-0.6B: 61.43% → +PPR: 74.28% (+12.85%)
  • Hybrid: 67.2% → +PPR: 76.55% (+9.35%)

NIAH Performance Example (128K context, Hybrid+PPR)

  • Llama-3.1-70B: 25.11% → 36.22% (+44% improvement)
  • GPT-4.1 mini: 58.27% → 62.09%
  • Gemini 2.5 Flash-Lite: 62.78% → 66.07%

Failure Mode Analysis

Case studies identify three primary failure modes:

  1. Cascading Error Propagation: Early errors amplified through query refinement and summarization
  2. Query Intent Deviation: Alters the nature or form of original questions
  3. Persistent Long-Context Challenges: Difficulty locating relevant information persists even in multi-turn settings

Long-Context Benchmarks

  • Classical NIAH: Kamradt (2023)'s single-needle test
  • Extended Versions: LV-Eval, RULER, BABILong and others extending question types and corpora
  • HELMET: First using dense retrieval for distractor construction, but lacks heterogeneity consideration
  • Limitations: All existing benchmarks use static, LLM-agnostic contexts

Multi-turn Benchmarks

  • Dialogue Evaluation: MT-bench and subsequent work focus on multi-turn dialogue
  • Agentic Benchmarks: AgentBench and others introduce multi-turn agentic tasks
  • Distinction: Existing work doesn't investigate joint long-context challenges of "width" and "depth"

Conclusions and Discussion

Main Conclusions

  1. Retrieval Strategy Crucial: Different retrieval methods significantly impact long-context evaluation difficulty and realism
  2. Graph Structure Effective: PPR reranking simultaneously improves retrieval effectiveness and model performance
  3. Agentic Challenges Unresolved: Even state-of-the-art models remain vulnerable in dynamic long-context reasoning
  4. Width vs. Depth: Models demonstrate greater robustness to long-context "width" than reasoning "depth"

Limitations

  1. Corpus Constraints: Based solely on English Wikipedia, potentially limiting generalizability
  2. QA Task Focus: Primarily addresses QA tasks with limited coverage of other long-context applications
  3. Retrieval Strategy Selection: While covering main categories, doesn't exhaust all possible retrieval methods
  4. Dynamic Setting Simplification: Agentic operation modeling relatively simple, may not fully reflect complex agentic systems

Future Directions

  1. Corpus Expansion: Support multilingual and multi-domain evaluation
  2. More Complex Agents: Integrate tool usage, external knowledge base access, etc.
  3. Adaptive Strategies: Develop retrieval strategies that dynamically adjust based on context
  4. Theoretical Analysis: Deeper understanding of why certain retrieval strategies introduce more difficult distractors

In-Depth Evaluation

Strengths

  1. Precise Problem Identification: Accurately identifies key deficiencies in existing long-context evaluation
  2. Methodological Innovation: Haystack engineering concept fills important evaluation gap
  3. Comprehensive Experimental Design: Covers 15 models, multiple retrieval strategies, static and dynamic settings
  4. High Practical Value: Provides realistic evaluation for long-context challenges in actual RAG systems
  5. Deep Insights: Reveals fundamental challenges in agentic long-context reasoning

Weaknesses

  1. High Computational Cost: Large-scale Wikipedia corpus and multi-model evaluation require substantial computational resources
  2. Data Contamination Risk: Despite mitigation measures, Wikipedia-based approach carries certain risks
  3. Simplified Agent Modeling: Dynamic NIAH may not fully capture complex agentic behaviors
  4. Limited Retriever Selection: Could consider more recent retrieval methods

Impact

  1. Academic Contribution: Establishes new standards and methodology for long-context evaluation
  2. Practical Guidance: Provides important insights for RAG system optimization
  3. Tool Value: HaystackCraft will become important evaluation tool
  4. Research Inspiration: Opens new research directions in agentic long-context reasoning

Applicable Scenarios

  1. RAG System Evaluation: Assesses different retrieval strategies' impact on long-context performance
  2. Model Selection: Guides selection of appropriate long-context models for specific applications
  3. Agent Development: Evaluates and improves agentic long-context reasoning capabilities
  4. Benchmark Development: Provides methodology for other researchers to construct realistic long-context benchmarks

References

The paper cites extensive related work, primarily including:

  • Long-context models and evaluation benchmark research
  • Retrieval-Augmented Generation (RAG) system studies
  • Multi-turn dialogue and agentic evaluation benchmarks
  • Graph neural networks and information retrieval methods

Overall Assessment: This is a high-quality research paper that accurately identifies important issues in long-context evaluation, proposes innovative solutions, and validates methodology effectiveness through comprehensive experiments. The HaystackCraft benchmark will significantly impact long-context LLM evaluation and improvement.