2025-11-26T19:55:18.852187

Revisiting Feedback Models for HyDE

Jedidi, Lin
Recent approaches that leverage large language models (LLMs) for pseudo-relevance feedback (PRF) have generally not utilized well-established feedback models like Rocchio and RM3 when expanding queries for sparse retrievers like BM25. Instead, they often opt for a simple string concatenation of the query and LLM-generated expansion content. But is this optimal? To answer this question, we revisit and systematically evaluate traditional feedback models in the context of HyDE, a popular method that enriches query representations with LLM-generated hypothetical answer documents. Our experiments show that HyDE's effectiveness can be substantially improved when leveraging feedback algorithms such as Rocchio to extract and weight expansion terms, providing a simple way to further enhance the accuracy of LLM-based PRF methods.
academic

Revisiting Feedback Models for HyDE

Basic Information

Abstract

Recent methods utilizing Large Language Models (LLMs) for pseudo-relevance feedback (PRF) typically do not employ mature feedback models (such as Rocchio and RM3) to expand queries for sparse retrievers (e.g., BM25). Instead, they simply concatenate queries with LLM-generated expansion content as strings. This paper systematically revisits the application of traditional feedback models in HyDE, a popular method that uses LLMs to generate hypothetical answer documents to enrich query representations. Experiments demonstrate that leveraging feedback algorithms such as Rocchio to extract and weight expansion terms can significantly improve HyDE's effectiveness, providing a simple yet effective approach for enhancing LLM-based PRF methods.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is: Do current LLM-based query expansion methods (such as HyDE) adequately leverage mature feedback models from traditional information retrieval when updating BM25 query representations?

Problem Significance

  1. Limitations of HyDE: While HyDE effectively uses LLMs to generate hypothetical documents to bridge the vocabulary gap between queries and relevant documents, it adopts a simple string concatenation strategy when integrating generated content into BM25 retrieval.
  2. Neglected Traditional Methods: The information retrieval field has accumulated decades of research on pseudo-relevance feedback, including well-validated feedback models such as Rocchio and RM3. However, these methods have been marginalized in the LLM era.
  3. Unexplored Optimization Space: Although the feedback source has changed (from retrieved documents to LLM-generated documents), whether the feedback mechanism itself requires modification has not been systematically studied.

Limitations of Existing Methods

  1. Simple Concatenation Strategy: Methods like Query2Doc and MuGI directly concatenate queries with LLM-generated text, lacking filtering and weighting of expansion terms.
  2. Ignoring Two-Stage Framework: Traditional PRF comprises two critical stages—term selection and weight assignment—which current LLM methods bypass.
  3. Lack of Systematic Comparison: Existing research primarily focuses on improving LLM-generated expansion content, with less attention to better utilizing this content.

Research Motivation

The authors discovered that the core difference between traditional PRF and LLM feedback methods lies only in the feedback source, yet the query update mechanisms differ significantly. This prompted the hypothesis: Traditional feedback models may be equally applicable to LLM-generated feedback content and may yield performance improvements.

Core Contributions

  1. First Systematic Evaluation: The first comprehensive comparison of traditional feedback models (Rocchio, RM3) with modern string concatenation methods in the context of LLM-generated feedback.
  2. Demonstrating the Value of Traditional Methods: Experiments show that applying traditional feedback algorithms such as Rocchio to HyDE significantly improves retrieval effectiveness, with an average improvement of 1.4 points (4.2%), and 2.2 points (6%) on low-resource tasks.
  3. Providing Practical Improvement Solutions: Offers a simple yet effective improvement to HyDE without modifying the LLM generation process, only changing the feedback integration mechanism.
  4. Open Source Implementation: Releases complete code implementation to facilitate community reproduction and further research.

Methodology Details

Task Definition

Input: User query qq
Output: Updated query representation qnewq_{new} for BM25 retrieval
Objective: Improve query representation by integrating LLM-generated hypothetical answer documents to enhance retrieval recall

HyDE Base Process

  1. Given query qq, prompt LLM to generate hypothetical answer documents
  2. Sample nn variants: d={d1,...,dn}d = \{d_1, ..., d_n\}
  3. Use these hypothetical documents to update query representation
  4. Perform BM25 retrieval with the updated query

Feedback Model Framework

The proposed framework comprises two core stages:

Stage 1: Term Selection (Section 2.1)

  1. Generate Term Frequency Vectors: Generate normalized term frequency vectors f(di)f(d_i) for each hypothetical document did_i
  2. Filter Common Terms: Remove high-frequency terms appearing in more than 10% of corpus documents
  3. Ranking and Truncation:
    • Rank candidate expansion terms by sum of normalized term frequencies
    • Retain top-kk terms (this paper sets k=128k=128)

Stage 2: Term Weighting (Section 2.2-2.3)

Method 1: Average Vector

This is an adaptation of HyDE's original method in bag-of-words space:

wt,qnew=1n+1didHyDEf(di)[t]w_{t,q_{new}} = \frac{1}{n+1} \sum_{d_i \in d_{HyDE}} f(d_i)[t]

where dHyDE={q,d1,...,dn}d_{HyDE} = \{q, d_1, ..., d_n\} (treating the query as an additional feedback document)

Characteristics:

  • Equal-weight averaging of query and feedback documents
  • Equivalent to string concatenation with term selection

Method 2: Rocchio Algorithm

A classical vector space feedback model introducing parameters to control relative weights of query and feedback documents:

wt,qnew=αf(q)[t]+βndidf(di)[t]w_{t,q_{new}} = \alpha \cdot f(q)[t] + \frac{\beta}{n} \sum_{d_i \in d} f(d_i)[t]

Parameter Settings:

  • α=1.0\alpha = 1.0: Query weight
  • β=0.75\beta = 0.75: Feedback document weight
  • Enables differential weighting of query terms and expansion terms

Method 3: RM3 (Relevance Model 3)

A language model-based feedback method estimating the observation probability of terms in relevant documents:

wt,qnew=λP(tq)+(1λ)didP(tdi)w_{t,q_{new}} = \lambda P(t|q) + (1-\lambda) \sum_{d_i \in d} P(t|d_i)

Parameter Settings:

  • λ=0.5\lambda = 0.5: Query-feedback interpolation weight
  • Based on probabilistic framework rather than vector space

Comparison with Baseline Methods

String Concatenation Methods:

  1. Naive Concat: qnew=Concat(q,d)q_{new} = \text{Concat}(q, d)
    • Direct concatenation without processing
  2. Query2Doc: qnew=Concat(q×5,d1)q_{new} = \text{Concat}(q \times 5, d_1)
    • Repeat query 5 times + single hypothetical document (128 tokens)
    • Total expansion terms approximately 128
  3. MuGI: Adaptive query repetition r=i=1nlen(di)len(q)ϕr = \frac{\sum_{i=1}^n \text{len}(d_i)}{\text{len}(q) \cdot \phi}qnew=Concat(q×r,d)q_{new} = \text{Concat}(q \times r, d)
    • ϕ=5\phi = 5: Control parameter
    • Dynamically adjust query repetition based on document length

Technical Innovations

  1. Unified Framework: Places traditional PRF and LLM feedback methods within the same framework for comparison, revealing mechanistic differences.
  2. Value of Term Selection: Quantifies the contribution of noise filtering by comparing methods with/without term selection.
  3. Parameterized Weight Control: Rocchio's α\alpha and β\beta parameters provide more stable weight control than string repetition.
  4. Cross-Feedback-Source Evaluation: Simultaneously evaluates traditional BM25 document feedback and LLM-generated document feedback, demonstrating LLM feedback superiority.

Experimental Setup

Datasets

MS MARCO Dataset (5 Web search tasks):

  • MS MARCO v1: TREC DL19, TREC DL20
  • MS MARCO v2: TREC DL21, TREC DL22, TREC DL23

BEIR Dataset (9 low-resource retrieval tasks):

  • Biomedical IR: TREC-Covid, NFCorpus
  • News Retrieval: TREC-News, Robust04
  • Financial QA: FiQA
  • Entity Retrieval: DBPedia
  • Fact Verification: SciFact
  • Citation Prediction: SciDocs
  • Argument Retrieval: ArguAna

Dataset Characteristics:

  • MS MARCO: Resource-rich, relatively homogeneous queries
  • BEIR: Zero-shot evaluation, high query diversity, broad domain coverage

Evaluation Metrics

Recall@20: Proportion of relevant documents among top-20 retrieved results

  • Suitable for evaluating first-stage retriever recall capability
  • Focuses on whether relevant documents can be retrieved, not ranking quality

Comparison Methods

Non-Expansion Baseline:

  • BM25 (without query expansion)

Traditional PRF (using BM25-retrieved documents):

  • BM25 + Average Vector
  • BM25 + RM3
  • BM25 + Rocchio

LLM Feedback Methods (using HyDE-generated documents):

  • Query2Doc
  • HyDE + Naive Concat
  • HyDE + MuGI Concat
  • HyDE + Average Vector
  • HyDE + RM3
  • HyDE + Rocchio

Implementation Details

LLM Configuration:

  • Models: Qwen2.5-7B-Instruct, Qwen3-14B, gpt-oss-20b
  • Sampling Quantity: n=8n=8 hypothetical documents
  • Document Length: Maximum 512 tokens
  • Inference Framework: vLLM

Feedback Model Parameters:

  • Rocchio: α=1.0\alpha=1.0, β=0.75\beta=0.75
  • RM3: λ=0.5\lambda=0.5
  • Term Quantity: k=128k=128 (aligned with Query2Doc)
  • Feedback Documents: 8 (matching HyDE sampling)

Retrieval System:

  • Implementation: Pyserini (based on Lucene)
  • BM25 Parameters: Default settings
  • Index Statistics: Obtained via IndexReader API
  • Custom Queries: Term weights set using QueryBuilder API

Experimental Results

Main Results (Table 1)

Overall Performance Comparison

Best Method: HyDE + Rocchio performs optimally across all LLMs

  • Qwen2.5-7B: Average Recall@20 = 34.0 (all datasets)
  • Qwen3-14B: Average Recall@20 = 34.7
  • gpt-oss-20b: Average Recall@20 = 34.7

Improvement over Strongest String Concatenation Baseline (MuGI):

  • Qwen2.5-7B: +1.1 points (3.3% improvement)
  • Qwen3-14B: +1.3 points (3.9% improvement)
  • gpt-oss-20b: +1.4 points (4.2% improvement)

Differential Performance: MS MARCO vs BEIR

MS MARCO Dataset:

  • String concatenation methods (MuGI, Query2Doc) are highly competitive
  • For example, gpt-oss-20b MuGI outperforms RM3 on all 5 MS MARCO datasets

BEIR Dataset (low-resource tasks):

  • Feedback models significantly outperform string concatenation
  • gpt-oss-20b + RM3:
    • Outperforms Query2Doc on all 9 BEIR datasets
    • Outperforms MuGI Concat on 8/9 datasets
  • Average Improvement (Rocchio vs MuGI):
    • Qwen2.5-7B: BEIR average +1.9 points
    • Qwen3-14B: BEIR average +1.9 points
    • gpt-oss-20b: BEIR average +2.2 points

Typical Cases:

  • SciFact (scientific fact verification):
    • gpt-oss-20b + Rocchio: 91.9
    • gpt-oss-20b + MuGI: 90.6
  • ArguAna (argument retrieval):
    • Qwen3-14B + Rocchio: 83.8
    • Qwen3-14B + MuGI: 76.4 (+7.4 points)

Ablation Studies and Key Findings

Finding 1: LLM Feedback Outperforms Traditional Document Feedback

Controlling feedback model, comparing feedback sources:

Using gpt-oss-20b as example (average across all datasets):

  • Average Vector: HyDE documents (32.5) vs BM25 documents (29.7) → +2.8 points
  • RM3: HyDE documents (33.2) vs BM25 documents (30.7) → +2.5 points
  • Rocchio: HyDE documents (34.7) vs BM25 documents (30.4) → +4.3 points

Conclusion: Under identical feedback mechanisms, LLM-generated hypothetical documents are more effective feedback sources than retrieved documents.

Interesting Observation:

  • RM3 outperforms Rocchio on BM25 documents (30.7 vs 30.4)
  • But Rocchio is superior on HyDE documents (34.7 vs 33.2)
  • Indicates that feedback source characteristics influence optimal feedback model selection

Finding 2: Critical Role of Term Selection

Comparing Average Vector vs Naive Concat:

  • Only difference: whether term selection and filtering are performed

Performance Gap (average across all datasets):

  • Qwen2.5-7B: 32.2 vs 29.3 → +3.0 points (10.2%)
  • Qwen3-14B: 32.5 vs 30.2 → +2.3 points (7.6%)
  • gpt-oss-20b: 32.5 vs 29.5 → +3.1 points (10.5%)

More Pronounced on BEIR Dataset:

  • Qwen2.5-7B BEIR: 36.6 vs 33.3 → +3.3 points

Conclusion: Filtering noisy terms (such as high-frequency words) is crucial for improving HyDE effectiveness.

Finding 3: Rocchio's Weight Control Advantage

Rocchio vs Average Vector:

  • Core difference: Rocchio assigns higher weights to query terms via α\alpha and β\beta parameters
  • Average Vector applies equal weights to all documents (including query)

Performance Comparison (average across all datasets):

  • Qwen2.5-7B: 34.0 vs 32.2 → +1.8 points
  • Qwen3-14B: 34.7 vs 32.5 → +2.2 points
  • gpt-oss-20b: 34.7 vs 32.5 → +2.2 points

Explanation:

  • HyDE's equal-weight averaging underestimates original query term importance
  • Rocchio's parameterized weights (α=1.0,β=0.75\alpha=1.0, \beta=0.75) provide better balance
  • Compared to MuGI's adaptive repetition, Rocchio's linear parameter control is more stable

Finding 4: Method Robustness Differences

Traditional PRF (without LLM) Competitiveness on BEIR:

  • BM25 + Rocchio (30.4) vs Query2Doc (32.7)
  • BM25 + Rocchio BEIR average (36.2) vs Query2Doc BEIR average (36.7)

Implications:

  • Feedback models themselves are more robust on diverse queries
  • Even without LLMs, Rocchio approaches LLM methods on low-resource tasks
  • Combining LLM with feedback models achieves best results

Cross-LLM Consistency

Consistent Trends Across All LLMs:

  1. Rocchio consistently optimal
  2. Term selection yields significant improvements
  3. Feedback model advantages more pronounced on BEIR

Impact of LLM Quality:

  • Stronger LLMs (Qwen3-14B) provide better absolute performance
  • Relative advantages of feedback models remain stable across different LLMs

Traditional Pseudo-Relevance Feedback

  1. Rocchio Algorithm 14: Classical feedback method in vector space model, adjusting query vector toward relevant documents
  2. Relevance Model (RM3) 1, 12: Language model-based feedback estimating term distribution in relevant documents
  3. Feedback Term Selection 3: Research on selecting high-quality expansion terms from feedback documents

LLM Query Expansion

  1. HyDE 9: Uses LLMs to generate hypothetical answer documents for zero-shot dense retrieval
  2. Query2Doc 16: Generates single hypothetical document and repeats query 5 times
  3. MuGI 20: Explores best practices for LLM query expansion, proposing adaptive query repetition
  • Inherits HyDE Concept: Utilizes LLM-generated hypothetical documents as feedback source
  • Bridges Traditional and Modern: Introduces Rocchio, RM3 and other traditional methods into LLM feedback scenarios
  • Addresses Missing Systematic Evaluation: First comprehensive comparison of traditional feedback models with string concatenation methods

Conclusions and Discussion

Main Conclusions

  1. Traditional Feedback Models Remain Effective: Classical methods such as Rocchio and RM3 remain applicable and powerful in the LLM era.
  2. Significant Performance Improvements:
    • Average improvement of 1.4 points (4.2%) over strongest string concatenation baseline
    • 2.2 points (6%) improvement on low-resource tasks
  3. Two Sources of Improvement:
    • Term Filtering: Removing noisy terms (high-frequency words, low-weight terms)
    • Weight Control: Stable control of query-feedback weights through parameters (rather than string repetition)
  4. Robustness Advantage: Feedback models demonstrate more stable performance on query-diverse BEIR datasets

Limitations

  1. Parameter Sensitivity Insufficiently Explored:
    • Uses default parameters from literature (α=1.0,β=0.75,λ=0.5\alpha=1.0, \beta=0.75, \lambda=0.5)
    • Lacks systematic study of parameter tuning potential
    • Different datasets may require different parameters
  2. Missing Computational Cost Analysis:
    • Feedback models require index statistics and term filtering
    • Additional overhead compared to simple string concatenation not quantified
  3. Limited LLM Selection:
    • Only 3 LLMs tested (Qwen series and gpt-oss)
    • Does not cover closed-source models like GPT-4, Claude
  4. Dense Retrieval Not Addressed:
    • Experiments focus only on BM25 sparse retrieval
    • Applicability to dense retrievers (e.g., ColBERT) unknown
  5. Interaction Effects Unexplored:
    • Interaction between feedback models and LLM prompting strategies
    • Impact of different sampling quantities (nn)

Future Directions

  1. Adaptive Parameter Adjustment:
    • Borrow MuGI's adaptive approach, dynamically adjust Rocchio's α\alpha and β\beta
    • Automatically select parameters based on query difficulty or document quality
  2. Hybrid Feedback Sources:
    • Combine LLM-generated and retrieved documents
    • Explore complementarity of two feedback sources
  3. Extension to Dense Retrieval:
    • Study application of feedback models in dense vector space
    • Design feedback mechanisms suitable for Transformer encoders
  4. End-to-End Optimization:
    • Jointly optimize LLM generation and feedback integration
    • Train feedback parameters via reinforcement learning
  5. Multi-Round Feedback:
    • Iteratively apply feedback models
    • Study convergence and stability

In-Depth Evaluation

Strengths

  1. Precise Problem Identification:
    • Identifies overlooked critical component in LLM query expansion research (feedback integration mechanism)
    • Poses simple but important question: "Is string concatenation optimal?"
  2. Rigorous Methodology:
    • Well-designed controlled variable experiments (comparing different models with same feedback source, comparing different feedback sources with same model)
    • Validates conclusion consistency across multiple LLMs
    • Covers 14 datasets including both high-resource and low-resource scenarios
  3. Comprehensive and Insightful Experiments:
    • Reports not only overall results but analyzes MS MARCO vs BEIR differences
    • Quantifies term selection contribution through Average Vector vs Naive Concat comparison
    • Compares traditional PRF and LLM feedback revealing importance of feedback source
  4. High Practical Value:
    • Simple implementation (no LLM modification needed)
    • Open source code promotes reproducibility
    • Provides plug-and-play performance improvement solution
  5. Clear Writing:
    • Clear logical structure (problem → method → experiments → conclusion)
    • Accurate technical detail descriptions
    • Well-designed tables facilitating comparison

Weaknesses

  1. Insufficient Theoretical Analysis:
    • Lacks deep theoretical explanation for "why Rocchio is more effective on HyDE"
    • Missing analysis from term distribution, information theory perspectives
    • Lacks theoretical guidance for parameter selection (e.g., α=1.0,β=0.75\alpha=1.0, \beta=0.75)
  2. Missing Parameter Sensitivity Study:
    • Only uses literature default parameters, no parameter sweep
    • Unclear robustness of conclusions to parameter changes
    • No exploration of optimal parameter configurations for different datasets
  3. Computational Cost Not Discussed:
    • Feedback models require index statistics access (IDF, etc.)
    • Time overhead of term filtering and weight computation not quantified
    • Missing efficiency comparison with simple concatenation
  4. Insufficient Case Analysis:
    • No concrete examples of expansion terms
    • Lacks qualitative analysis of "which terms are retained/filtered"
    • Difficult to intuitively understand actual feedback model effects
  5. Limited Applicability Scope:
    • Only evaluates BM25 sparse retrieval
    • Applicability to neural retrievers (ColBERT, ANCE) unknown
    • Does not consider multilingual or cross-lingual scenarios
  6. Missing Statistical Significance Testing:
    • No confidence intervals or p-values reported
    • Unclear whether observed improvements are statistically significant

Impact

Contributions to the Field:

  1. Reactivates Classical Methods: Reminds community not to overlook traditional IR techniques
  2. Establishes Evaluation Benchmark: Provides comparison baselines for future LLM query expansion research
  3. Inspires Hybrid Methods: Encourages combining traditional and modern techniques

Practical Value:

  1. Immediately Applicable: Existing HyDE users can directly apply Rocchio improvements
  2. High Cost-Benefit: Achieves improvements without LLM retraining
  3. Industrial Applicability: BM25 widely used in industry, method easily deployable

Reproducibility:

  1. ✅ Open source code
  2. ✅ Public datasets
  3. ✅ Detailed hyperparameter specifications
  4. ✅ Based on mature tools (Pyserini, vLLM)

Potential Citation Value:

  • Expected to become important reference in LLM query expansion research
  • Provides strong baselines for evaluating new methods
  • May inspire more traditional-modern hybrid methods

Applicable Scenarios

Recommended Use Cases:

  1. Low-Resource Retrieval Tasks: BEIR-type diverse query scenarios
  2. BM25 Sparse Retrieval: First-stage retrieval or hybrid retrieval systems
  3. Computationally Constrained: Lower overhead than training neural retrievers
  4. Interpretability Required: Term weights visualizable and debuggable

Inapplicable Scenarios:

  1. Dense Retrieval Systems: Requires further research for adaptation
  2. Real-Time Retrieval: Index statistics access may increase latency
  3. Extremely Short Queries: Difficult to balance feedback weights with few query terms
  4. End-to-End Optimization Required: Feedback model parameters not jointly trained with LLM

Implementation Recommendations:

  1. Prioritize trying Rocchio (α=1.0,β=0.75\alpha=1.0, \beta=0.75)
  2. Adjust parameters based on task characteristics (increase α\alpha when query importance high)
  3. Combine with term selection (filter high-frequency words, retain top-128 terms)
  4. Monitor performance across datasets, tune parameters as needed

Key References

1 Abdul-Jaleel et al., 2004. UMass at TREC 2004: Novelty and HARD

  • Proposes RM3 feedback model

9 Gao et al., 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels (ACL)

  • Original HyDE method

14 Rocchio, 1971. Relevance Feedback in Information Retrieval

  • Classical Rocchio algorithm literature

16 Wang et al., 2023. Query2doc: Query Expansion with Large Language Models (EMNLP)

  • Representative LLM query expansion work

20 Zhang et al., 2024. Exploring the Best Practices of Query Expansion with Large Language Models (EMNLP)

  • MuGI method, exploring best practices for LLM query expansion

Summary

This is a high-quality IR research paper with clear problem orientation, simple yet effective methodology, and comprehensive rigorous experiments. The authors astutely identify an overlooked but important problem in LLM query expansion research, and through systematic experiments demonstrate the enduring value of traditional feedback models. The paper's main insight is: Technological progress should not come at the cost of abandoning classical methods; combining traditional and modern techniques often yields superior solutions.

While the paper has room for improvement in theoretical depth and parameter optimization, its strong practical value and excellent reproducibility suggest it will positively impact information retrieval research in the LLM era. For practitioners, this represents a low-cost, high-return improvement solution; for researchers, it opens a worthwhile direction for deeper exploration.