Recent approaches that leverage large language models (LLMs) for pseudo-relevance feedback (PRF) have generally not utilized well-established feedback models like Rocchio and RM3 when expanding queries for sparse retrievers like BM25. Instead, they often opt for a simple string concatenation of the query and LLM-generated expansion content. But is this optimal? To answer this question, we revisit and systematically evaluate traditional feedback models in the context of HyDE, a popular method that enriches query representations with LLM-generated hypothetical answer documents. Our experiments show that HyDE's effectiveness can be substantially improved when leveraging feedback algorithms such as Rocchio to extract and weight expansion terms, providing a simple way to further enhance the accuracy of LLM-based PRF methods.
- Paper ID: 2511.19349
- Title: Revisiting Feedback Models for HyDE
- Authors: Nour Jedidi, Jimmy Lin (University of Waterloo)
- Category: cs.IR (Information Retrieval)
- Submission Date: November 24, 2025 (arXiv)
- Paper Link: https://arxiv.org/abs/2511.19349
- Open Source Code: https://github.com/nourj98/hyde-feedback
Recent methods utilizing Large Language Models (LLMs) for pseudo-relevance feedback (PRF) typically do not employ mature feedback models (such as Rocchio and RM3) to expand queries for sparse retrievers (e.g., BM25). Instead, they simply concatenate queries with LLM-generated expansion content as strings. This paper systematically revisits the application of traditional feedback models in HyDE, a popular method that uses LLMs to generate hypothetical answer documents to enrich query representations. Experiments demonstrate that leveraging feedback algorithms such as Rocchio to extract and weight expansion terms can significantly improve HyDE's effectiveness, providing a simple yet effective approach for enhancing LLM-based PRF methods.
The core problem addressed in this paper is: Do current LLM-based query expansion methods (such as HyDE) adequately leverage mature feedback models from traditional information retrieval when updating BM25 query representations?
- Limitations of HyDE: While HyDE effectively uses LLMs to generate hypothetical documents to bridge the vocabulary gap between queries and relevant documents, it adopts a simple string concatenation strategy when integrating generated content into BM25 retrieval.
- Neglected Traditional Methods: The information retrieval field has accumulated decades of research on pseudo-relevance feedback, including well-validated feedback models such as Rocchio and RM3. However, these methods have been marginalized in the LLM era.
- Unexplored Optimization Space: Although the feedback source has changed (from retrieved documents to LLM-generated documents), whether the feedback mechanism itself requires modification has not been systematically studied.
- Simple Concatenation Strategy: Methods like Query2Doc and MuGI directly concatenate queries with LLM-generated text, lacking filtering and weighting of expansion terms.
- Ignoring Two-Stage Framework: Traditional PRF comprises two critical stages—term selection and weight assignment—which current LLM methods bypass.
- Lack of Systematic Comparison: Existing research primarily focuses on improving LLM-generated expansion content, with less attention to better utilizing this content.
The authors discovered that the core difference between traditional PRF and LLM feedback methods lies only in the feedback source, yet the query update mechanisms differ significantly. This prompted the hypothesis: Traditional feedback models may be equally applicable to LLM-generated feedback content and may yield performance improvements.
- First Systematic Evaluation: The first comprehensive comparison of traditional feedback models (Rocchio, RM3) with modern string concatenation methods in the context of LLM-generated feedback.
- Demonstrating the Value of Traditional Methods: Experiments show that applying traditional feedback algorithms such as Rocchio to HyDE significantly improves retrieval effectiveness, with an average improvement of 1.4 points (4.2%), and 2.2 points (6%) on low-resource tasks.
- Providing Practical Improvement Solutions: Offers a simple yet effective improvement to HyDE without modifying the LLM generation process, only changing the feedback integration mechanism.
- Open Source Implementation: Releases complete code implementation to facilitate community reproduction and further research.
Input: User query q
Output: Updated query representation qnew for BM25 retrieval
Objective: Improve query representation by integrating LLM-generated hypothetical answer documents to enhance retrieval recall
- Given query q, prompt LLM to generate hypothetical answer documents
- Sample n variants: d={d1,...,dn}
- Use these hypothetical documents to update query representation
- Perform BM25 retrieval with the updated query
The proposed framework comprises two core stages:
- Generate Term Frequency Vectors: Generate normalized term frequency vectors f(di) for each hypothetical document di
- Filter Common Terms: Remove high-frequency terms appearing in more than 10% of corpus documents
- Ranking and Truncation:
- Rank candidate expansion terms by sum of normalized term frequencies
- Retain top-k terms (this paper sets k=128)
Method 1: Average Vector
This is an adaptation of HyDE's original method in bag-of-words space:
wt,qnew=n+11∑di∈dHyDEf(di)[t]
where dHyDE={q,d1,...,dn} (treating the query as an additional feedback document)
Characteristics:
- Equal-weight averaging of query and feedback documents
- Equivalent to string concatenation with term selection
Method 2: Rocchio Algorithm
A classical vector space feedback model introducing parameters to control relative weights of query and feedback documents:
wt,qnew=α⋅f(q)[t]+nβ∑di∈df(di)[t]
Parameter Settings:
- α=1.0: Query weight
- β=0.75: Feedback document weight
- Enables differential weighting of query terms and expansion terms
Method 3: RM3 (Relevance Model 3)
A language model-based feedback method estimating the observation probability of terms in relevant documents:
wt,qnew=λP(t∣q)+(1−λ)∑di∈dP(t∣di)
Parameter Settings:
- λ=0.5: Query-feedback interpolation weight
- Based on probabilistic framework rather than vector space
String Concatenation Methods:
- Naive Concat: qnew=Concat(q,d)
- Direct concatenation without processing
- Query2Doc: qnew=Concat(q×5,d1)
- Repeat query 5 times + single hypothetical document (128 tokens)
- Total expansion terms approximately 128
- MuGI: Adaptive query repetition
r=len(q)⋅ϕ∑i=1nlen(di)qnew=Concat(q×r,d)
- ϕ=5: Control parameter
- Dynamically adjust query repetition based on document length
- Unified Framework: Places traditional PRF and LLM feedback methods within the same framework for comparison, revealing mechanistic differences.
- Value of Term Selection: Quantifies the contribution of noise filtering by comparing methods with/without term selection.
- Parameterized Weight Control: Rocchio's α and β parameters provide more stable weight control than string repetition.
- Cross-Feedback-Source Evaluation: Simultaneously evaluates traditional BM25 document feedback and LLM-generated document feedback, demonstrating LLM feedback superiority.
MS MARCO Dataset (5 Web search tasks):
- MS MARCO v1: TREC DL19, TREC DL20
- MS MARCO v2: TREC DL21, TREC DL22, TREC DL23
BEIR Dataset (9 low-resource retrieval tasks):
- Biomedical IR: TREC-Covid, NFCorpus
- News Retrieval: TREC-News, Robust04
- Financial QA: FiQA
- Entity Retrieval: DBPedia
- Fact Verification: SciFact
- Citation Prediction: SciDocs
- Argument Retrieval: ArguAna
Dataset Characteristics:
- MS MARCO: Resource-rich, relatively homogeneous queries
- BEIR: Zero-shot evaluation, high query diversity, broad domain coverage
Recall@20: Proportion of relevant documents among top-20 retrieved results
- Suitable for evaluating first-stage retriever recall capability
- Focuses on whether relevant documents can be retrieved, not ranking quality
Non-Expansion Baseline:
- BM25 (without query expansion)
Traditional PRF (using BM25-retrieved documents):
- BM25 + Average Vector
- BM25 + RM3
- BM25 + Rocchio
LLM Feedback Methods (using HyDE-generated documents):
- Query2Doc
- HyDE + Naive Concat
- HyDE + MuGI Concat
- HyDE + Average Vector
- HyDE + RM3
- HyDE + Rocchio
LLM Configuration:
- Models: Qwen2.5-7B-Instruct, Qwen3-14B, gpt-oss-20b
- Sampling Quantity: n=8 hypothetical documents
- Document Length: Maximum 512 tokens
- Inference Framework: vLLM
Feedback Model Parameters:
- Rocchio: α=1.0, β=0.75
- RM3: λ=0.5
- Term Quantity: k=128 (aligned with Query2Doc)
- Feedback Documents: 8 (matching HyDE sampling)
Retrieval System:
- Implementation: Pyserini (based on Lucene)
- BM25 Parameters: Default settings
- Index Statistics: Obtained via IndexReader API
- Custom Queries: Term weights set using QueryBuilder API
Best Method: HyDE + Rocchio performs optimally across all LLMs
- Qwen2.5-7B: Average Recall@20 = 34.0 (all datasets)
- Qwen3-14B: Average Recall@20 = 34.7
- gpt-oss-20b: Average Recall@20 = 34.7
Improvement over Strongest String Concatenation Baseline (MuGI):
- Qwen2.5-7B: +1.1 points (3.3% improvement)
- Qwen3-14B: +1.3 points (3.9% improvement)
- gpt-oss-20b: +1.4 points (4.2% improvement)
MS MARCO Dataset:
- String concatenation methods (MuGI, Query2Doc) are highly competitive
- For example, gpt-oss-20b MuGI outperforms RM3 on all 5 MS MARCO datasets
BEIR Dataset (low-resource tasks):
- Feedback models significantly outperform string concatenation
- gpt-oss-20b + RM3:
- Outperforms Query2Doc on all 9 BEIR datasets
- Outperforms MuGI Concat on 8/9 datasets
- Average Improvement (Rocchio vs MuGI):
- Qwen2.5-7B: BEIR average +1.9 points
- Qwen3-14B: BEIR average +1.9 points
- gpt-oss-20b: BEIR average +2.2 points
Typical Cases:
- SciFact (scientific fact verification):
- gpt-oss-20b + Rocchio: 91.9
- gpt-oss-20b + MuGI: 90.6
- ArguAna (argument retrieval):
- Qwen3-14B + Rocchio: 83.8
- Qwen3-14B + MuGI: 76.4 (+7.4 points)
Controlling feedback model, comparing feedback sources:
Using gpt-oss-20b as example (average across all datasets):
- Average Vector: HyDE documents (32.5) vs BM25 documents (29.7) → +2.8 points
- RM3: HyDE documents (33.2) vs BM25 documents (30.7) → +2.5 points
- Rocchio: HyDE documents (34.7) vs BM25 documents (30.4) → +4.3 points
Conclusion: Under identical feedback mechanisms, LLM-generated hypothetical documents are more effective feedback sources than retrieved documents.
Interesting Observation:
- RM3 outperforms Rocchio on BM25 documents (30.7 vs 30.4)
- But Rocchio is superior on HyDE documents (34.7 vs 33.2)
- Indicates that feedback source characteristics influence optimal feedback model selection
Comparing Average Vector vs Naive Concat:
- Only difference: whether term selection and filtering are performed
Performance Gap (average across all datasets):
- Qwen2.5-7B: 32.2 vs 29.3 → +3.0 points (10.2%)
- Qwen3-14B: 32.5 vs 30.2 → +2.3 points (7.6%)
- gpt-oss-20b: 32.5 vs 29.5 → +3.1 points (10.5%)
More Pronounced on BEIR Dataset:
- Qwen2.5-7B BEIR: 36.6 vs 33.3 → +3.3 points
Conclusion: Filtering noisy terms (such as high-frequency words) is crucial for improving HyDE effectiveness.
Rocchio vs Average Vector:
- Core difference: Rocchio assigns higher weights to query terms via α and β parameters
- Average Vector applies equal weights to all documents (including query)
Performance Comparison (average across all datasets):
- Qwen2.5-7B: 34.0 vs 32.2 → +1.8 points
- Qwen3-14B: 34.7 vs 32.5 → +2.2 points
- gpt-oss-20b: 34.7 vs 32.5 → +2.2 points
Explanation:
- HyDE's equal-weight averaging underestimates original query term importance
- Rocchio's parameterized weights (α=1.0,β=0.75) provide better balance
- Compared to MuGI's adaptive repetition, Rocchio's linear parameter control is more stable
Traditional PRF (without LLM) Competitiveness on BEIR:
- BM25 + Rocchio (30.4) vs Query2Doc (32.7)
- BM25 + Rocchio BEIR average (36.2) vs Query2Doc BEIR average (36.7)
Implications:
- Feedback models themselves are more robust on diverse queries
- Even without LLMs, Rocchio approaches LLM methods on low-resource tasks
- Combining LLM with feedback models achieves best results
Consistent Trends Across All LLMs:
- Rocchio consistently optimal
- Term selection yields significant improvements
- Feedback model advantages more pronounced on BEIR
Impact of LLM Quality:
- Stronger LLMs (Qwen3-14B) provide better absolute performance
- Relative advantages of feedback models remain stable across different LLMs
- Rocchio Algorithm 14: Classical feedback method in vector space model, adjusting query vector toward relevant documents
- Relevance Model (RM3) 1, 12: Language model-based feedback estimating term distribution in relevant documents
- Feedback Term Selection 3: Research on selecting high-quality expansion terms from feedback documents
- HyDE 9: Uses LLMs to generate hypothetical answer documents for zero-shot dense retrieval
- Query2Doc 16: Generates single hypothetical document and repeats query 5 times
- MuGI 20: Explores best practices for LLM query expansion, proposing adaptive query repetition
- Inherits HyDE Concept: Utilizes LLM-generated hypothetical documents as feedback source
- Bridges Traditional and Modern: Introduces Rocchio, RM3 and other traditional methods into LLM feedback scenarios
- Addresses Missing Systematic Evaluation: First comprehensive comparison of traditional feedback models with string concatenation methods
- Traditional Feedback Models Remain Effective: Classical methods such as Rocchio and RM3 remain applicable and powerful in the LLM era.
- Significant Performance Improvements:
- Average improvement of 1.4 points (4.2%) over strongest string concatenation baseline
- 2.2 points (6%) improvement on low-resource tasks
- Two Sources of Improvement:
- Term Filtering: Removing noisy terms (high-frequency words, low-weight terms)
- Weight Control: Stable control of query-feedback weights through parameters (rather than string repetition)
- Robustness Advantage: Feedback models demonstrate more stable performance on query-diverse BEIR datasets
- Parameter Sensitivity Insufficiently Explored:
- Uses default parameters from literature (α=1.0,β=0.75,λ=0.5)
- Lacks systematic study of parameter tuning potential
- Different datasets may require different parameters
- Missing Computational Cost Analysis:
- Feedback models require index statistics and term filtering
- Additional overhead compared to simple string concatenation not quantified
- Limited LLM Selection:
- Only 3 LLMs tested (Qwen series and gpt-oss)
- Does not cover closed-source models like GPT-4, Claude
- Dense Retrieval Not Addressed:
- Experiments focus only on BM25 sparse retrieval
- Applicability to dense retrievers (e.g., ColBERT) unknown
- Interaction Effects Unexplored:
- Interaction between feedback models and LLM prompting strategies
- Impact of different sampling quantities (n)
- Adaptive Parameter Adjustment:
- Borrow MuGI's adaptive approach, dynamically adjust Rocchio's α and β
- Automatically select parameters based on query difficulty or document quality
- Hybrid Feedback Sources:
- Combine LLM-generated and retrieved documents
- Explore complementarity of two feedback sources
- Extension to Dense Retrieval:
- Study application of feedback models in dense vector space
- Design feedback mechanisms suitable for Transformer encoders
- End-to-End Optimization:
- Jointly optimize LLM generation and feedback integration
- Train feedback parameters via reinforcement learning
- Multi-Round Feedback:
- Iteratively apply feedback models
- Study convergence and stability
- Precise Problem Identification:
- Identifies overlooked critical component in LLM query expansion research (feedback integration mechanism)
- Poses simple but important question: "Is string concatenation optimal?"
- Rigorous Methodology:
- Well-designed controlled variable experiments (comparing different models with same feedback source, comparing different feedback sources with same model)
- Validates conclusion consistency across multiple LLMs
- Covers 14 datasets including both high-resource and low-resource scenarios
- Comprehensive and Insightful Experiments:
- Reports not only overall results but analyzes MS MARCO vs BEIR differences
- Quantifies term selection contribution through Average Vector vs Naive Concat comparison
- Compares traditional PRF and LLM feedback revealing importance of feedback source
- High Practical Value:
- Simple implementation (no LLM modification needed)
- Open source code promotes reproducibility
- Provides plug-and-play performance improvement solution
- Clear Writing:
- Clear logical structure (problem → method → experiments → conclusion)
- Accurate technical detail descriptions
- Well-designed tables facilitating comparison
- Insufficient Theoretical Analysis:
- Lacks deep theoretical explanation for "why Rocchio is more effective on HyDE"
- Missing analysis from term distribution, information theory perspectives
- Lacks theoretical guidance for parameter selection (e.g., α=1.0,β=0.75)
- Missing Parameter Sensitivity Study:
- Only uses literature default parameters, no parameter sweep
- Unclear robustness of conclusions to parameter changes
- No exploration of optimal parameter configurations for different datasets
- Computational Cost Not Discussed:
- Feedback models require index statistics access (IDF, etc.)
- Time overhead of term filtering and weight computation not quantified
- Missing efficiency comparison with simple concatenation
- Insufficient Case Analysis:
- No concrete examples of expansion terms
- Lacks qualitative analysis of "which terms are retained/filtered"
- Difficult to intuitively understand actual feedback model effects
- Limited Applicability Scope:
- Only evaluates BM25 sparse retrieval
- Applicability to neural retrievers (ColBERT, ANCE) unknown
- Does not consider multilingual or cross-lingual scenarios
- Missing Statistical Significance Testing:
- No confidence intervals or p-values reported
- Unclear whether observed improvements are statistically significant
Contributions to the Field:
- Reactivates Classical Methods: Reminds community not to overlook traditional IR techniques
- Establishes Evaluation Benchmark: Provides comparison baselines for future LLM query expansion research
- Inspires Hybrid Methods: Encourages combining traditional and modern techniques
Practical Value:
- Immediately Applicable: Existing HyDE users can directly apply Rocchio improvements
- High Cost-Benefit: Achieves improvements without LLM retraining
- Industrial Applicability: BM25 widely used in industry, method easily deployable
Reproducibility:
- ✅ Open source code
- ✅ Public datasets
- ✅ Detailed hyperparameter specifications
- ✅ Based on mature tools (Pyserini, vLLM)
Potential Citation Value:
- Expected to become important reference in LLM query expansion research
- Provides strong baselines for evaluating new methods
- May inspire more traditional-modern hybrid methods
Recommended Use Cases:
- Low-Resource Retrieval Tasks: BEIR-type diverse query scenarios
- BM25 Sparse Retrieval: First-stage retrieval or hybrid retrieval systems
- Computationally Constrained: Lower overhead than training neural retrievers
- Interpretability Required: Term weights visualizable and debuggable
Inapplicable Scenarios:
- Dense Retrieval Systems: Requires further research for adaptation
- Real-Time Retrieval: Index statistics access may increase latency
- Extremely Short Queries: Difficult to balance feedback weights with few query terms
- End-to-End Optimization Required: Feedback model parameters not jointly trained with LLM
Implementation Recommendations:
- Prioritize trying Rocchio (α=1.0,β=0.75)
- Adjust parameters based on task characteristics (increase α when query importance high)
- Combine with term selection (filter high-frequency words, retain top-128 terms)
- Monitor performance across datasets, tune parameters as needed
1 Abdul-Jaleel et al., 2004. UMass at TREC 2004: Novelty and HARD
- Proposes RM3 feedback model
9 Gao et al., 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels (ACL)
14 Rocchio, 1971. Relevance Feedback in Information Retrieval
- Classical Rocchio algorithm literature
16 Wang et al., 2023. Query2doc: Query Expansion with Large Language Models (EMNLP)
- Representative LLM query expansion work
20 Zhang et al., 2024. Exploring the Best Practices of Query Expansion with Large Language Models (EMNLP)
- MuGI method, exploring best practices for LLM query expansion
This is a high-quality IR research paper with clear problem orientation, simple yet effective methodology, and comprehensive rigorous experiments. The authors astutely identify an overlooked but important problem in LLM query expansion research, and through systematic experiments demonstrate the enduring value of traditional feedback models. The paper's main insight is: Technological progress should not come at the cost of abandoning classical methods; combining traditional and modern techniques often yields superior solutions.
While the paper has room for improvement in theoretical depth and parameter optimization, its strong practical value and excellent reproducibility suggest it will positively impact information retrieval research in the LLM era. For practitioners, this represents a low-cost, high-return improvement solution; for researchers, it opens a worthwhile direction for deeper exploration.