2025-11-26T19:55:18.852187

Revisiting Feedback Models for HyDE

Jedidi, Lin

Recent approaches that leverage large language models (LLMs) for pseudo-relevance feedback (PRF) have generally not utilized well-established feedback models like Rocchio and RM3 when expanding queries for sparse retrievers like BM25. Instead, they often opt for a simple string concatenation of the query and LLM-generated expansion content. But is this optimal? To answer this question, we revisit and systematically evaluate traditional feedback models in the context of HyDE, a popular method that enriches query representations with LLM-generated hypothetical answer documents. Our experiments show that HyDE's effectiveness can be substantially improved when leveraging feedback algorithms such as Rocchio to extract and weight expansion terms, providing a simple way to further enhance the accuracy of LLM-based PRF methods.

academic

Revisiting Feedback Models for HyDE

Basic Information

Paper ID: 2511.19349
Title: Revisiting Feedback Models for HyDE
Authors: Nour Jedidi, Jimmy Lin (University of Waterloo)
Category: cs.IR (Information Retrieval)
Submission Date: November 24, 2025 (arXiv)
Paper Link: https://arxiv.org/abs/2511.19349
Open Source Code: https://github.com/nourj98/hyde-feedback

Abstract

Recent methods utilizing Large Language Models (LLMs) for pseudo-relevance feedback (PRF) typically do not employ mature feedback models (such as Rocchio and RM3) to expand queries for sparse retrievers (e.g., BM25). Instead, they simply concatenate queries with LLM-generated expansion content as strings. This paper systematically revisits the application of traditional feedback models in HyDE, a popular method that uses LLMs to generate hypothetical answer documents to enrich query representations. Experiments demonstrate that leveraging feedback algorithms such as Rocchio to extract and weight expansion terms can significantly improve HyDE's effectiveness, providing a simple yet effective approach for enhancing LLM-based PRF methods.

Research Background and Motivation

Problem Definition

The core problem addressed in this paper is: Do current LLM-based query expansion methods (such as HyDE) adequately leverage mature feedback models from traditional information retrieval when updating BM25 query representations?

Problem Significance

Limitations of HyDE: While HyDE effectively uses LLMs to generate hypothetical documents to bridge the vocabulary gap between queries and relevant documents, it adopts a simple string concatenation strategy when integrating generated content into BM25 retrieval.
Neglected Traditional Methods: The information retrieval field has accumulated decades of research on pseudo-relevance feedback, including well-validated feedback models such as Rocchio and RM3. However, these methods have been marginalized in the LLM era.
Unexplored Optimization Space: Although the feedback source has changed (from retrieved documents to LLM-generated documents), whether the feedback mechanism itself requires modification has not been systematically studied.

Limitations of Existing Methods

Simple Concatenation Strategy: Methods like Query2Doc and MuGI directly concatenate queries with LLM-generated text, lacking filtering and weighting of expansion terms.
Ignoring Two-Stage Framework: Traditional PRF comprises two critical stages—term selection and weight assignment—which current LLM methods bypass.
Lack of Systematic Comparison: Existing research primarily focuses on improving LLM-generated expansion content, with less attention to better utilizing this content.

Research Motivation

The authors discovered that the core difference between traditional PRF and LLM feedback methods lies only in the feedback source, yet the query update mechanisms differ significantly. This prompted the hypothesis: Traditional feedback models may be equally applicable to LLM-generated feedback content and may yield performance improvements.

Core Contributions

First Systematic Evaluation: The first comprehensive comparison of traditional feedback models (Rocchio, RM3) with modern string concatenation methods in the context of LLM-generated feedback.
Demonstrating the Value of Traditional Methods: Experiments show that applying traditional feedback algorithms such as Rocchio to HyDE significantly improves retrieval effectiveness, with an average improvement of 1.4 points (4.2%), and 2.2 points (6%) on low-resource tasks.
Providing Practical Improvement Solutions: Offers a simple yet effective improvement to HyDE without modifying the LLM generation process, only changing the feedback integration mechanism.
Open Source Implementation: Releases complete code implementation to facilitate community reproduction and further research.

Methodology Details

Task Definition

Input: User query $q$
Output: Updated query representation $q_{new}$ for BM25 retrieval
Objective: Improve query representation by integrating LLM-generated hypothetical answer documents to enhance retrieval recall

HyDE Base Process

Given query $q$ , prompt LLM to generate hypothetical answer documents
Sample $n$ variants: $d = \{d_1, ..., d_n\}$
Use these hypothetical documents to update query representation
Perform BM25 retrieval with the updated query

Feedback Model Framework

The proposed framework comprises two core stages:

Stage 1: Term Selection (Section 2.1)

Generate Term Frequency Vectors: Generate normalized term frequency vectors $f(d_i)$ for each hypothetical document $d_i$
Filter Common Terms: Remove high-frequency terms appearing in more than 10% of corpus documents
Ranking and Truncation:
- Rank candidate expansion terms by sum of normalized term frequencies
- Retain top- $k$ terms (this paper sets $k=128$ )

Stage 2: Term Weighting (Section 2.2-2.3)

Method 1: Average Vector

This is an adaptation of HyDE's original method in bag-of-words space:

$w_{t,q_{new}} = \frac{1}{n+1} \sum_{d_i \in d_{HyDE}} f(d_i)[t]$

where $d_{HyDE} = \{q, d_1, ..., d_n\}$ (treating the query as an additional feedback document)

Characteristics:

Equal-weight averaging of query and feedback documents
Equivalent to string concatenation with term selection

Method 2: Rocchio Algorithm

A classical vector space feedback model introducing parameters to control relative weights of query and feedback documents:

$w_{t,q_{new}} = \alpha \cdot f(q)[t] + \frac{\beta}{n} \sum_{d_i \in d} f(d_i)[t]$

Parameter Settings:

$\alpha = 1.0$ : Query weight
$\beta = 0.75$ : Feedback document weight
Enables differential weighting of query terms and expansion terms

Method 3: RM3 (Relevance Model 3)

A language model-based feedback method estimating the observation probability of terms in relevant documents:

$w_{t,q_{new}} = \lambda P(t|q) + (1-\lambda) \sum_{d_i \in d} P(t|d_i)$

Parameter Settings:

$\lambda = 0.5$ : Query-feedback interpolation weight
Based on probabilistic framework rather than vector space

Comparison with Baseline Methods

String Concatenation Methods:

Naive Concat: $q_{new} = \text{Concat}(q, d)$ $q_{n e w} = Concat (q, d)$
- Direct concatenation without processing
Query2Doc: $q_{new} = \text{Concat}(q \times 5, d_1)$ $q_{n e w} = Concat (q \times 5, d_{1})$
- Repeat query 5 times + single hypothetical document (128 tokens)
- Total expansion terms approximately 128
MuGI: Adaptive query repetition $r = \frac{\sum_{i=1}^n \text{len}(d_i)}{\text{len}(q) \cdot \phi}$ $r = \frac{\sum _{i = 1}^{n} len ( d _{i} )}{len ( q ) \cdot ϕ}$ $q_{new} = \text{Concat}(q \times r, d)$ $q_{n e w} = Concat (q \times r, d)$
- $\phi = 5$ : Control parameter
- Dynamically adjust query repetition based on document length

Technical Innovations

Unified Framework: Places traditional PRF and LLM feedback methods within the same framework for comparison, revealing mechanistic differences.
Value of Term Selection: Quantifies the contribution of noise filtering by comparing methods with/without term selection.
Parameterized Weight Control: Rocchio's $\alpha$ and $\beta$ parameters provide more stable weight control than string repetition.
Cross-Feedback-Source Evaluation: Simultaneously evaluates traditional BM25 document feedback and LLM-generated document feedback, demonstrating LLM feedback superiority.

Experimental Setup

Datasets

MS MARCO Dataset (5 Web search tasks):

MS MARCO v1: TREC DL19, TREC DL20
MS MARCO v2: TREC DL21, TREC DL22, TREC DL23

BEIR Dataset (9 low-resource retrieval tasks):

Biomedical IR: TREC-Covid, NFCorpus
News Retrieval: TREC-News, Robust04
Financial QA: FiQA
Entity Retrieval: DBPedia
Fact Verification: SciFact
Citation Prediction: SciDocs
Argument Retrieval: ArguAna

Dataset Characteristics:

MS MARCO: Resource-rich, relatively homogeneous queries
BEIR: Zero-shot evaluation, high query diversity, broad domain coverage

Evaluation Metrics

Recall@20: Proportion of relevant documents among top-20 retrieved results

Suitable for evaluating first-stage retriever recall capability
Focuses on whether relevant documents can be retrieved, not ranking quality

Comparison Methods

Non-Expansion Baseline:

BM25 (without query expansion)

Traditional PRF (using BM25-retrieved documents):

BM25 + Average Vector
BM25 + RM3
BM25 + Rocchio

LLM Feedback Methods (using HyDE-generated documents):

Query2Doc
HyDE + Naive Concat
HyDE + MuGI Concat
HyDE + Average Vector
HyDE + RM3
HyDE + Rocchio

Implementation Details

LLM Configuration:

Models: Qwen2.5-7B-Instruct, Qwen3-14B, gpt-oss-20b
Sampling Quantity: $n=8$ hypothetical documents
Document Length: Maximum 512 tokens
Inference Framework: vLLM

Feedback Model Parameters:

Rocchio: $\alpha=1.0$ , $\beta=0.75$
RM3: $\lambda=0.5$
Term Quantity: $k=128$ (aligned with Query2Doc)
Feedback Documents: 8 (matching HyDE sampling)

Retrieval System:

Implementation: Pyserini (based on Lucene)
BM25 Parameters: Default settings
Index Statistics: Obtained via IndexReader API
Custom Queries: Term weights set using QueryBuilder API

Experimental Results

Main Results (Table 1)

Overall Performance Comparison

Best Method: HyDE + Rocchio performs optimally across all LLMs

Qwen2.5-7B: Average Recall@20 = 34.0 (all datasets)
Qwen3-14B: Average Recall@20 = 34.7
gpt-oss-20b: Average Recall@20 = 34.7

Improvement over Strongest String Concatenation Baseline (MuGI):

Qwen2.5-7B: +1.1 points (3.3% improvement)
Qwen3-14B: +1.3 points (3.9% improvement)
gpt-oss-20b: +1.4 points (4.2% improvement)

Differential Performance: MS MARCO vs BEIR

MS MARCO Dataset:

String concatenation methods (MuGI, Query2Doc) are highly competitive
For example, gpt-oss-20b MuGI outperforms RM3 on all 5 MS MARCO datasets

BEIR Dataset (low-resource tasks):

Feedback models significantly outperform string concatenation
gpt-oss-20b + RM3:
- Outperforms Query2Doc on all 9 BEIR datasets
- Outperforms MuGI Concat on 8/9 datasets
Average Improvement (Rocchio vs MuGI):
- Qwen2.5-7B: BEIR average +1.9 points
- Qwen3-14B: BEIR average +1.9 points
- gpt-oss-20b: BEIR average +2.2 points

Typical Cases:

SciFact (scientific fact verification):
- gpt-oss-20b + Rocchio: 91.9
- gpt-oss-20b + MuGI: 90.6
ArguAna (argument retrieval):
- Qwen3-14B + Rocchio: 83.8
- Qwen3-14B + MuGI: 76.4 (+7.4 points)

Ablation Studies and Key Findings

Finding 1: LLM Feedback Outperforms Traditional Document Feedback

Controlling feedback model, comparing feedback sources:

Using gpt-oss-20b as example (average across all datasets):

Average Vector: HyDE documents (32.5) vs BM25 documents (29.7) → +2.8 points
RM3: HyDE documents (33.2) vs BM25 documents (30.7) → +2.5 points
Rocchio: HyDE documents (34.7) vs BM25 documents (30.4) → +4.3 points

Conclusion: Under identical feedback mechanisms, LLM-generated hypothetical documents are more effective feedback sources than retrieved documents.

Interesting Observation:

RM3 outperforms Rocchio on BM25 documents (30.7 vs 30.4)
But Rocchio is superior on HyDE documents (34.7 vs 33.2)
Indicates that feedback source characteristics influence optimal feedback model selection

Finding 2: Critical Role of Term Selection

Comparing Average Vector vs Naive Concat:

Only difference: whether term selection and filtering are performed

Performance Gap (average across all datasets):

Qwen2.5-7B: 32.2 vs 29.3 → +3.0 points (10.2%)
Qwen3-14B: 32.5 vs 30.2 → +2.3 points (7.6%)
gpt-oss-20b: 32.5 vs 29.5 → +3.1 points (10.5%)

More Pronounced on BEIR Dataset:

Qwen2.5-7B BEIR: 36.6 vs 33.3 → +3.3 points

Conclusion: Filtering noisy terms (such as high-frequency words) is crucial for improving HyDE effectiveness.

Finding 3: Rocchio's Weight Control Advantage

Rocchio vs Average Vector:

Core difference: Rocchio assigns higher weights to query terms via $\alpha$ and $\beta$ parameters
Average Vector applies equal weights to all documents (including query)

Performance Comparison (average across all datasets):

Qwen2.5-7B: 34.0 vs 32.2 → +1.8 points
Qwen3-14B: 34.7 vs 32.5 → +2.2 points
gpt-oss-20b: 34.7 vs 32.5 → +2.2 points

Explanation:

HyDE's equal-weight averaging underestimates original query term importance
Rocchio's parameterized weights ( $\alpha=1.0, \beta=0.75$ ) provide better balance
Compared to MuGI's adaptive repetition, Rocchio's linear parameter control is more stable

Finding 4: Method Robustness Differences

Traditional PRF (without LLM) Competitiveness on BEIR:

BM25 + Rocchio (30.4) vs Query2Doc (32.7)
BM25 + Rocchio BEIR average (36.2) vs Query2Doc BEIR average (36.7)

Implications:

Feedback models themselves are more robust on diverse queries
Even without LLMs, Rocchio approaches LLM methods on low-resource tasks
Combining LLM with feedback models achieves best results

Cross-LLM Consistency

Consistent Trends Across All LLMs:

Rocchio consistently optimal
Term selection yields significant improvements
Feedback model advantages more pronounced on BEIR

Impact of LLM Quality:

Stronger LLMs (Qwen3-14B) provide better absolute performance
Relative advantages of feedback models remain stable across different LLMs

Traditional Pseudo-Relevance Feedback

Rocchio Algorithm 14: Classical feedback method in vector space model, adjusting query vector toward relevant documents
Relevance Model (RM3) 1, 12: Language model-based feedback estimating term distribution in relevant documents
Feedback Term Selection 3: Research on selecting high-quality expansion terms from feedback documents

LLM Query Expansion

HyDE 9: Uses LLMs to generate hypothetical answer documents for zero-shot dense retrieval
Query2Doc 16: Generates single hypothetical document and repeats query 5 times
MuGI 20: Explores best practices for LLM query expansion, proposing adaptive query repetition

Inherits HyDE Concept: Utilizes LLM-generated hypothetical documents as feedback source
Bridges Traditional and Modern: Introduces Rocchio, RM3 and other traditional methods into LLM feedback scenarios
Addresses Missing Systematic Evaluation: First comprehensive comparison of traditional feedback models with string concatenation methods

Conclusions and Discussion

Main Conclusions

Traditional Feedback Models Remain Effective: Classical methods such as Rocchio and RM3 remain applicable and powerful in the LLM era.
Significant Performance Improvements:
- Average improvement of 1.4 points (4.2%) over strongest string concatenation baseline
- 2.2 points (6%) improvement on low-resource tasks
Two Sources of Improvement:
- Term Filtering: Removing noisy terms (high-frequency words, low-weight terms)
- Weight Control: Stable control of query-feedback weights through parameters (rather than string repetition)
Robustness Advantage: Feedback models demonstrate more stable performance on query-diverse BEIR datasets

Limitations

Parameter Sensitivity Insufficiently Explored:
- Uses default parameters from literature ( $\alpha=1.0, \beta=0.75, \lambda=0.5$ )
- Lacks systematic study of parameter tuning potential
- Different datasets may require different parameters
Missing Computational Cost Analysis:
- Feedback models require index statistics and term filtering
- Additional overhead compared to simple string concatenation not quantified
Limited LLM Selection:
- Only 3 LLMs tested (Qwen series and gpt-oss)
- Does not cover closed-source models like GPT-4, Claude
Dense Retrieval Not Addressed:
- Experiments focus only on BM25 sparse retrieval
- Applicability to dense retrievers (e.g., ColBERT) unknown
Interaction Effects Unexplored:
- Interaction between feedback models and LLM prompting strategies
- Impact of different sampling quantities ( $n$ )

Future Directions

Adaptive Parameter Adjustment:
- Borrow MuGI's adaptive approach, dynamically adjust Rocchio's $\alpha$ and $\beta$
- Automatically select parameters based on query difficulty or document quality
Hybrid Feedback Sources:
- Combine LLM-generated and retrieved documents
- Explore complementarity of two feedback sources
Extension to Dense Retrieval:
- Study application of feedback models in dense vector space
- Design feedback mechanisms suitable for Transformer encoders
End-to-End Optimization:
- Jointly optimize LLM generation and feedback integration
- Train feedback parameters via reinforcement learning
Multi-Round Feedback:
- Iteratively apply feedback models
- Study convergence and stability

In-Depth Evaluation

Strengths

Precise Problem Identification:
- Identifies overlooked critical component in LLM query expansion research (feedback integration mechanism)
- Poses simple but important question: "Is string concatenation optimal?"
Rigorous Methodology:
- Well-designed controlled variable experiments (comparing different models with same feedback source, comparing different feedback sources with same model)
- Validates conclusion consistency across multiple LLMs
- Covers 14 datasets including both high-resource and low-resource scenarios
Comprehensive and Insightful Experiments:
- Reports not only overall results but analyzes MS MARCO vs BEIR differences
- Quantifies term selection contribution through Average Vector vs Naive Concat comparison
- Compares traditional PRF and LLM feedback revealing importance of feedback source
High Practical Value:
- Simple implementation (no LLM modification needed)
- Open source code promotes reproducibility
- Provides plug-and-play performance improvement solution
Clear Writing:
- Clear logical structure (problem → method → experiments → conclusion)
- Accurate technical detail descriptions
- Well-designed tables facilitating comparison

Weaknesses

Insufficient Theoretical Analysis:
- Lacks deep theoretical explanation for "why Rocchio is more effective on HyDE"
- Missing analysis from term distribution, information theory perspectives
- Lacks theoretical guidance for parameter selection (e.g., $\alpha=1.0, \beta=0.75$ )
Missing Parameter Sensitivity Study:
- Only uses literature default parameters, no parameter sweep
- Unclear robustness of conclusions to parameter changes
- No exploration of optimal parameter configurations for different datasets
Computational Cost Not Discussed:
- Feedback models require index statistics access (IDF, etc.)
- Time overhead of term filtering and weight computation not quantified
- Missing efficiency comparison with simple concatenation
Insufficient Case Analysis:
- No concrete examples of expansion terms
- Lacks qualitative analysis of "which terms are retained/filtered"
- Difficult to intuitively understand actual feedback model effects
Limited Applicability Scope:
- Only evaluates BM25 sparse retrieval
- Applicability to neural retrievers (ColBERT, ANCE) unknown
- Does not consider multilingual or cross-lingual scenarios
Missing Statistical Significance Testing:
- No confidence intervals or p-values reported
- Unclear whether observed improvements are statistically significant

Impact

Contributions to the Field:

Reactivates Classical Methods: Reminds community not to overlook traditional IR techniques
Establishes Evaluation Benchmark: Provides comparison baselines for future LLM query expansion research
Inspires Hybrid Methods: Encourages combining traditional and modern techniques

Practical Value:

Immediately Applicable: Existing HyDE users can directly apply Rocchio improvements
High Cost-Benefit: Achieves improvements without LLM retraining
Industrial Applicability: BM25 widely used in industry, method easily deployable

Reproducibility:

✅ Open source code
✅ Public datasets
✅ Detailed hyperparameter specifications
✅ Based on mature tools (Pyserini, vLLM)

Potential Citation Value:

Expected to become important reference in LLM query expansion research
Provides strong baselines for evaluating new methods
May inspire more traditional-modern hybrid methods

Applicable Scenarios

Recommended Use Cases:

Low-Resource Retrieval Tasks: BEIR-type diverse query scenarios
BM25 Sparse Retrieval: First-stage retrieval or hybrid retrieval systems
Computationally Constrained: Lower overhead than training neural retrievers
Interpretability Required: Term weights visualizable and debuggable

Inapplicable Scenarios:

Dense Retrieval Systems: Requires further research for adaptation
Real-Time Retrieval: Index statistics access may increase latency
Extremely Short Queries: Difficult to balance feedback weights with few query terms
End-to-End Optimization Required: Feedback model parameters not jointly trained with LLM

Implementation Recommendations:

Prioritize trying Rocchio ( $\alpha=1.0, \beta=0.75$ )
Adjust parameters based on task characteristics (increase $\alpha$ when query importance high)
Combine with term selection (filter high-frequency words, retain top-128 terms)
Monitor performance across datasets, tune parameters as needed

Key References

1 Abdul-Jaleel et al., 2004. UMass at TREC 2004: Novelty and HARD

Proposes RM3 feedback model

9 Gao et al., 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels (ACL)

Original HyDE method

14 Rocchio, 1971. Relevance Feedback in Information Retrieval

Classical Rocchio algorithm literature

16 Wang et al., 2023. Query2doc: Query Expansion with Large Language Models (EMNLP)

Representative LLM query expansion work

20 Zhang et al., 2024. Exploring the Best Practices of Query Expansion with Large Language Models (EMNLP)

MuGI method, exploring best practices for LLM query expansion

Summary

This is a high-quality IR research paper with clear problem orientation, simple yet effective methodology, and comprehensive rigorous experiments. The authors astutely identify an overlooked but important problem in LLM query expansion research, and through systematic experiments demonstrate the enduring value of traditional feedback models. The paper's main insight is: Technological progress should not come at the cost of abandoning classical methods; combining traditional and modern techniques often yields superior solutions.

While the paper has room for improvement in theoretical depth and parameter optimization, its strong practical value and excellent reproducibility suggest it will positively impact information retrieval research in the LLM era. For practitioners, this represents a low-cost, high-return improvement solution; for researchers, it opens a worthwhile direction for deeper exploration.