2025-11-25T18:49:17.995403

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Li, Fu, Wang et al.

Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

academic

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Basic Information

Paper ID: 2510.07414
Title: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
Authors: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li
Institutions: Georgia Institute of Technology, Meta AI, University of Illinois Urbana-Champaign, National University of Singapore
Classification: cs.CL, cs.AI, cs.IR
Publication Date: October 2025 (Preprint)
Paper Link: https://arxiv.org/abs/2510.07414

Abstract

Modern long-context large language models perform well on synthetic "Needle in a Haystack" (NIAH) benchmarks, yet these tests overlook how noisy contexts arise from biased retrieval and agentic workflows. This paper introduces the concept of haystack engineering for constructing realistic noisy long contexts that faithfully capture critical real-world factors—interference from heterogeneous biased retrievers and cascading errors in agentic workflows—to test models' long-context robustness. The authors implement this concept through HaystackCraft, a novel NIAH benchmark constructed on the complete English Wikipedia hyperlink network and multi-hop questions. Experimental results demonstrate that even advanced models like Gemini 2.5 Pro and GPT-5 suffer from cascading failures or struggle with early stopping in agentic tests.

Research Background and Motivation

Core Problems

Existing long-context evaluation benchmarks exhibit significant gaps between simulation and reality:

Limitations of Static Synthetic Benchmarks: Traditional NIAH tests use query-irrelevant distractors, whereas real-world long contexts are constructed through retrieval strategies like RAG, exhibiting retriever-dependent characteristics.
Neglect of Retrieval Heterogeneity: Different retrieval strategies (sparse, dense, hybrid, graph-based) introduce different types of distractors, yet existing benchmarks fail to account for this heterogeneity's impact on model performance.
Absence of Dynamic Agentic Evaluation: Existing benchmarks are static, single-turn, and LLM-agnostic, unable to evaluate cascading error problems in agentic context engineering.

Research Motivation

The authors argue for "haystack engineering" to construct realistic noisy long contexts that faithfully simulate real-world complexity and failure patterns. This contrasts with "context engineering," which seeks optimal conditions, whereas the former emphasizes faithful haystack construction.

Core Contributions

Proposes Haystack Engineering Concept: Systematically investigates retrieval strategy impacts on long-context evaluation for the first time, reformulating NIAH problems from a RAG perspective.
Constructs HaystackCraft Benchmark:
- Based on complete English Wikipedia hyperlink network (6,954,909 articles, 97,442,472 hyperlinks)
- Includes multi-hop QA tasks supporting heterogeneous retrieval strategy evaluation
- First dynamic, multi-turn, LLM-dependent NIAH testing environment
Comprehensive Heterogeneous Retrieval Evaluation: Systematically evaluates sparse (BM25), dense (Qwen3-Embedding), hybrid, and graph-based (PPR) retrieval strategies' effects on distractor composition and model performance.
Reveals Agentic Long-Context Challenges: Through dynamic NIAH testing, discovers that even advanced models suffer cascading failures in agentic workflows, with models showing greater robustness to "width" (long context) than "depth" (reasoning iterations).

Methodology Details

Task Definition

Reformulates NIAH problem from RAG perspective:

Given document corpus D and query q
True supporting document set N_q ⊂ D (needles)
Retrieval strategy R scores and ranks all documents in D
Constructs haystack H^R_q(S): contains all needle documents and top-ranked distractors, totaling S tokens

Static NIAH Evaluation

Heterogeneous Retrieval Strategies

Sparse Retrieval (BM25): Classical lexical similarity-based method
Dense Retrieval (Qwen3-Embedding-0.6B): Captures semantic similarity
Hybrid Retrieval: Combines sparse and dense retrieval using Reciprocal Rank Fusion (RRF)
Graph-based Reranking: Integrates structural information using Personalized PageRank (PPR)

Haystack Ordering Strategies

Retriever Ordering: Ranked by retrieval scores (realistic RAG setting)
Random Ordering: Randomly shuffled (diagnostic for position bias)

Dynamic NIAH Evaluation

Agentic Operation Modeling

Extends static NIAH to support multi-turn interactions:

Query Refinement: Optimizes queries based on retrieval results
Self-Reflection: Summarizes previous analyses
Stopping Decision: Determines when to terminate reasoning

Two Dynamic Settings

Forced Multi-turn: Fixed reasoning iterations, tests cascading error robustness
Variable Turns: Model autonomously decides stopping point, tests early stopping capability

Technical Innovations

Retriever-Distractor Composition Mapping: First systematic study of how different retrieval methods shape distractor characteristics
Graph Structure Utilization: Models multi-hop QA as "needle subgraph" identification problem
Dynamic Context Engineering: Novel evaluation paradigm where LLM serves as both reasoner and distractor source
Width vs. Depth Analysis: Distinguishes effects of long-context "width" and reasoning "depth"

Experimental Setup

Dataset

Corpus: English Wikipedia dump from 2025-04-04, using complete articles as retrieval units
QA Dataset:
- Natural Questions (NQ): Single-hop questions
- MuSiQue: Multi-hop questions (up to 4 supporting documents)
- Manually filtered to 500 high-quality samples

Model Coverage

Evaluates 15 long-context LLMs:

Reasoning Models: Qwen3 series, Gemini 2.5 Flash-Lite, o4-mini
General Models: GPT-4.1 mini, Llama-3.1 series, Qwen2.5-1M, Gemma 3 series
Top-tier Models: Gemini 2.5 Pro, GPT-5 (dynamic testing)

Evaluation Metrics

Retrieval Performance: Recall@N, NDCG@N
QA Performance: F1 score
Context Sizes: 8K, 16K, 32K, 64K, 128K tokens

Implementation Details

Uses Qwen2.5-1M tokenizer for unified token counting
PPR hyperparameters optimized via grid search
vLLM employed for inference acceleration

Experimental Results

Main Findings

1. Retrieval Strategy Significantly Impacts Haystack Difficulty

Dense Retrieval More Challenging: In 11/12 cases, dense retrievers introduce more difficult distractors than sparse retrievers
Hybrid Retrieval Not Necessarily Harder: Despite better retrieval performance, doesn't necessarily introduce more challenging distractors
Graph-based Reranking Dual Benefits: Simultaneously improves retrieval effectiveness and mitigates harmful distractors, with NIAH performance gains up to 44%

2. Model-Dependent Effects of Haystack Ordering

Highly Model-Correlated: Vast differences in model responses to retriever ordering
Significant Benefits for Some Models: Gemma-3 and Qwen2.5-1M series gain significant and increasing benefits from retriever ordering
Evaluation Necessity: Requires simultaneous evaluation of retriever ordering and random ordering for comprehensive model behavior understanding

3. Dynamic NIAH Reveals Agentic Vulnerability

Forced Multi-turn Results:

All models (including GPT-5, Gemini 2.5 Pro) susceptible to cascading errors
Performance deteriorates with additional iterations, often amplifying early errors
Static NIAH performance cannot predict multi-turn robustness

Variable Turns Results:

No model reliably improves single-turn performance
GPT-5 performs relatively best but fails to convert multi-turn reasoning into sustained improvement
Models universally lack effective early stopping mechanisms

Specific Numerical Results

Retrieval Performance (Recall@160)

BM25: 58.73% → BM25+PPR: 66.58% (+7.85%)
Qwen3-0.6B: 61.43% → +PPR: 74.28% (+12.85%)
Hybrid: 67.2% → +PPR: 76.55% (+9.35%)

NIAH Performance Example (128K context, Hybrid+PPR)

Llama-3.1-70B: 25.11% → 36.22% (+44% improvement)
GPT-4.1 mini: 58.27% → 62.09%
Gemini 2.5 Flash-Lite: 62.78% → 66.07%

Failure Mode Analysis

Case studies identify three primary failure modes:

Cascading Error Propagation: Early errors amplified through query refinement and summarization
Query Intent Deviation: Alters the nature or form of original questions
Persistent Long-Context Challenges: Difficulty locating relevant information persists even in multi-turn settings

Long-Context Benchmarks

Classical NIAH: Kamradt (2023)'s single-needle test
Extended Versions: LV-Eval, RULER, BABILong and others extending question types and corpora
HELMET: First using dense retrieval for distractor construction, but lacks heterogeneity consideration
Limitations: All existing benchmarks use static, LLM-agnostic contexts

Multi-turn Benchmarks

Dialogue Evaluation: MT-bench and subsequent work focus on multi-turn dialogue
Agentic Benchmarks: AgentBench and others introduce multi-turn agentic tasks
Distinction: Existing work doesn't investigate joint long-context challenges of "width" and "depth"

Conclusions and Discussion

Main Conclusions

Retrieval Strategy Crucial: Different retrieval methods significantly impact long-context evaluation difficulty and realism
Graph Structure Effective: PPR reranking simultaneously improves retrieval effectiveness and model performance
Agentic Challenges Unresolved: Even state-of-the-art models remain vulnerable in dynamic long-context reasoning
Width vs. Depth: Models demonstrate greater robustness to long-context "width" than reasoning "depth"

Limitations

Corpus Constraints: Based solely on English Wikipedia, potentially limiting generalizability
QA Task Focus: Primarily addresses QA tasks with limited coverage of other long-context applications
Retrieval Strategy Selection: While covering main categories, doesn't exhaust all possible retrieval methods
Dynamic Setting Simplification: Agentic operation modeling relatively simple, may not fully reflect complex agentic systems

Future Directions

Corpus Expansion: Support multilingual and multi-domain evaluation
More Complex Agents: Integrate tool usage, external knowledge base access, etc.
Adaptive Strategies: Develop retrieval strategies that dynamically adjust based on context
Theoretical Analysis: Deeper understanding of why certain retrieval strategies introduce more difficult distractors

In-Depth Evaluation

Strengths

Precise Problem Identification: Accurately identifies key deficiencies in existing long-context evaluation
Methodological Innovation: Haystack engineering concept fills important evaluation gap
Comprehensive Experimental Design: Covers 15 models, multiple retrieval strategies, static and dynamic settings
High Practical Value: Provides realistic evaluation for long-context challenges in actual RAG systems
Deep Insights: Reveals fundamental challenges in agentic long-context reasoning

Weaknesses

High Computational Cost: Large-scale Wikipedia corpus and multi-model evaluation require substantial computational resources
Data Contamination Risk: Despite mitigation measures, Wikipedia-based approach carries certain risks
Simplified Agent Modeling: Dynamic NIAH may not fully capture complex agentic behaviors
Limited Retriever Selection: Could consider more recent retrieval methods

Impact

Academic Contribution: Establishes new standards and methodology for long-context evaluation
Practical Guidance: Provides important insights for RAG system optimization
Tool Value: HaystackCraft will become important evaluation tool
Research Inspiration: Opens new research directions in agentic long-context reasoning

Applicable Scenarios

RAG System Evaluation: Assesses different retrieval strategies' impact on long-context performance
Model Selection: Guides selection of appropriate long-context models for specific applications
Agent Development: Evaluates and improves agentic long-context reasoning capabilities
Benchmark Development: Provides methodology for other researchers to construct realistic long-context benchmarks

References

The paper cites extensive related work, primarily including:

Long-context models and evaluation benchmark research
Retrieval-Augmented Generation (RAG) system studies
Multi-turn dialogue and agentic evaluation benchmarks
Graph neural networks and information retrieval methods

Overall Assessment: This is a high-quality research paper that accurately identifies important issues in long-context evaluation, proposes innovative solutions, and validates methodology effectiveness through comprehensive experiments. The HaystackCraft benchmark will significantly impact long-context LLM evaluation and improvement.