2025-11-14T03:13:11.609221

Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Zhao, Yu, Xu

Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.

academic

Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Basic Information

Paper ID: 2510.10009
Title: Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning
Authors: Shu Zhao (NVIDIA & Pennsylvania State University), Tan Yu (NVIDIA), Anbang Xu (NVIDIA)
Classification: cs.CL cs.AI cs.IR
Publication Date: 2025-10-14 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10009

Abstract

Reasoning-augmented search agents (such as Search-R1) are trained to iteratively reason, search, and generate final answers. However, due to their limited capabilities in reasoning and searching, their performance on multi-hop question-answering benchmarks remains suboptimal. To handle complex or composite queries, the authors train an LLM-based search agent with native query expansion capabilities through reinforcement learning. In each round, the search agent proposes multiple query variants while searching simultaneously to cover more relevant information. Considering limited post-training data and computational resources, the search agent struggles to master multiple tasks, including query generation, retrieval information comprehension, and answer generation. Therefore, the authors propose incorporating a pre-trained compressor model to help the search agent understand retrieved documents, enabling the search agent to focus on query generation for high retrieval recall. With the assistance of the compressor model, the authors demonstrate that even small-scale 3B LLMs can exhibit strong query expansion capabilities and achieve state-of-the-art accuracy on multi-hop question-answering benchmarks. Specifically, experiments on seven question-answering benchmarks show that the ExpandSearch method achieves an average improvement of 4.4% over state-of-the-art baselines, with significant gains on multi-hop reasoning tasks requiring diverse evidence aggregation.

Research Background and Motivation

Problem Definition

Existing reasoning-augmented search agents face two core challenges:

Semantic Incompleteness: Generated queries are semantically impoverished, failing to capture the complete range of relevant information, particularly when facing multi-faceted questions requiring diverse evidence
Information Overload: Retrieved content contains substantial irrelevant information, obscuring critical facts and degrading reasoning quality

Research Significance

Multi-hop question-answering tasks require aggregating evidence from multiple perspectives. The semantic limitations of single queries and theoretical constraints of vector embedding-based retrieval severely restrict system performance. This problem is particularly acute in complex reasoning scenarios where agents must navigate extensive search results to identify sparse yet critical evidence.

Limitations of Existing Methods

Methods like Search-R1 generate only a single query per round, easily missing critical semantic information
Lengthy retrieved content leads to high computational costs, substantial GPU memory consumption, and significantly reduced training speed
Signal-to-noise ratio problems are particularly severe in multi-hop reasoning tasks

Research Motivation

The authors' core insight is that effective information retrieval requires a dual strategy—expanding the query space to maximize relevant information coverage, then selectively distilling retrieved content to retain only facts critical for reasoning. This "expand-compress" paradigm reflects human information-seeking behavior.

Core Contributions

Identified and formalized the dual problem: Semantic incompleteness and information overload in reasoning-augmented search agents, with empirical analysis demonstrating that both problems significantly reduce performance on complex reasoning tasks
Proposed the ExpandSearch framework: An "expand-compress" framework combining reinforcement learning-based query expansion and prompt-based selective information refinement, achieving high recall in multi-step reasoning scenarios while maintaining precision
Achieved significant performance improvements: Substantial improvements over state-of-the-art baselines on seven benchmarks, with particularly strong performance on multi-hop reasoning tasks requiring diverse evidence aggregation

Methodology Details

Task Definition

Given an input query x, the search agent must generate a final answer y through an iterative reasoning-search process, where each round can invoke a search engine R to retrieve relevant document chunks and perform reasoning based on retrieved information.

Model Architecture

Expand-then-Squeeze Strategy

Expand Phase:

LLM generates <search></search> blocks containing n diverse queries {qi}
Each query qi retrieves k most relevant chunks through search engine R: Ci = c1i, ..., cki ← R(qi)
Effectively overcomes single-query retrieval limitations and improves retrieval recall

Squeeze Phase:

Input generated queries q1, ..., qn and retrieved chunks C1, ..., Cn into a frozen LLM compressor πs
Generate summary: s = πs(q1, ..., qn, C1, ..., Cn)
Compressed information s is encapsulated in <information></information> blocks inserted into the ongoing generation sequence

Technical Innovations

1. Query Expansion Types

Two complementary expansion types naturally discovered through reinforcement learning:

Syntactic Expansion: Handles surface form variations, e.g., "where did he die" → "his death place"
Semantic Expansion: Broadens information scope, e.g., "Alex's father" → "Alex's family"

2. Modular Architecture Design

Search Agent: Focuses on query generation to achieve high retrieval recall
Compressor Model: Independently handles retrieved document comprehension through API calls for decoupling

3. Reward Function Design

Employs a weighted combination reward function: r = rEM + λrf

rEM: Exact match reward, equals 1 when predicted answer exactly matches ground truth
rf: Format reward, equals 1 when predicted answer strictly follows format
λ defaults to 0.2

Experimental Setup

Datasets

Covers seven benchmarks, divided into two categories:

General Question-Answering: NQ, TriviaQA, PopQA
Multi-hop Question-Answering: HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle

Following Jin et al.'s setup, combines NQ and HotpotQA training sets for training, evaluates on validation/test sets for in-domain and out-of-domain generalization.

Evaluation Metrics

Uses Exact Match (EM) as the primary evaluation metric.

Baseline Methods

R1 without search engine
Search-R1
ZeroSearch
StepSearch
Router-R1
ParallelSearch

Implementation Details

Backbone Model: Qwen-2.5-Base/Instruct (3B/7B)
Embedding Model: E5
Corpus: 2018 Wikipedia dump
Hardware: 8×NVIDIA H100 GPUs
Algorithm: PPO (Proximal Policy Optimization)
Batching: Total batch size 512, mini-batch 256, micro-batch 64

Experimental Results

Main Results

Consistent and significant improvements across all configurations:

Average improvement of 4.4%: Absolute improvement over strongest baseline
Small model advantage: 3B-Instruct model (0.457 average EM) surpasses 7B baseline methods
Architecture impact: Instruct variant outperforms base variant by 2.2% in 3B models; base variant outperforms instruct variant by 3.1% in 7B models

Ablation Studies

Impact of Query Expansion Count

Significant performance gains from increasing 1 to 3 queries:

n=1 to n=2: Average improvement of 6.7%
Continued improvement at n=3, but with diminishing returns

Importance of End-to-End Training

ExpandSearch (n=3, k=5) improves 34.3% over Search-R1 (k=15)
Adding expansion prompts without RL training even reduces performance
Demonstrates the critical role of end-to-end training in learning effective query expansion strategies

Expansion Type Analysis

Syntactic expansion comprises 63.35%, semantic expansion comprises 36.65%
Removing either type results in performance degradation, confirming their complementarity

Compressor Behavior Analysis

Retrieval Depth: Consistent but diminishing gains from increasing k from 3 to 10
Model Selection: LLaMA-3.1-70B performs better on general question-answering, LLaMA-4-17B better on multi-hop reasoning
Generalization Ability: Comparable performance when using different compressor models during training and inference

Training Dynamics

Reward, response length, and search frequency increase synchronously
Model autonomously learns to increase search frequency as a strategy for improving answer quality
Smooth training curves indicate stable optimization process

Deep Search Agents

RAG Systems: Two-stage pipelines with retrieval followed by generation, but often contain irrelevant information
Search Tool Frameworks: Such as IRCoT, ReAct guided by prompting; Toolformer through supervised fine-tuning
Reinforcement Learning Methods: Search-R1 pioneered RL techniques; subsequent developments include ZeroSearch, MaskSearch, etc.

Reinforcement Learning

RLHF: Training reward models through human preference annotations
Efficiency Optimization: DPO, SimPO, ORPO methods bypass reward model training
Emerging Techniques: GRPO, RLOO offer promising alternatives through group-wise policy evaluation

Conclusions and Discussion

Main Conclusions

ExpandSearch effectively addresses single-query retrieval limitations through learned query expansion and selective information refinement
The "expand-compress" paradigm successfully tackles the dual challenges of semantic incompleteness and information overload
Even 3B-scale models can exhibit strong query expansion capabilities and achieve state-of-the-art performance

Limitations

Computational Cost: Multiple query retrieval and compressor invocations increase inference time
Dependency: Performance depends on compressor model quality
Expansion Saturation: Diminishing returns from increasing query count

Future Directions

Adaptive Retrieval Strategies: Dynamically adjust expansion count based on query complexity
More Efficient Training Methods: Reduce dependence on large-scale computational resources
End-to-End Optimization: Jointly train search agent and compressor model

In-Depth Evaluation

Strengths

Methodological Innovation: First combination of query expansion with reinforcement learning; ingenious "expand-compress" paradigm design
Experimental Comprehensiveness: Seven benchmarks, multiple model scales, detailed ablation studies
Technical Insights: Discovery of syntactic and semantic expansion complementarity provides valuable technical insights
Practical Value: Small models achieve excellent performance, demonstrating practical deployment value

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for why this approach works
Computational Efficiency: Insufficient analysis of computational overhead from multiple query retrieval
Generalization Ability: Primarily validated on question-answering tasks; applicability to other tasks unknown
Compressor Dependency: Reliance on external compressor models may limit application scenarios

Impact

Academic Contribution: Provides new research directions for retrieval-augmented generation field
Practical Value: Modular design facilitates practical application and deployment
Reproducibility: Provides detailed implementation details and open-source commitment

Applicable Scenarios

Multi-hop Question-Answering Systems: Particularly suitable for question-answering tasks requiring complex reasoning
Information Retrieval Systems: Applicable to retrieval scenarios requiring high recall
Dialogue Systems: Can be integrated into dialogue agents requiring external knowledge

References

The paper cites multiple important works, including:

Search-R1 (Jin et al., 2025b): Pioneering RL search agent work
RLHF-related works (Ouyang et al., 2022): Foundation for RL-based LLM training
Multiple question-answering datasets: Standard benchmarks including NQ, HotpotQA, TriviaQA

This paper proposes an innovative solution to address core challenges in current search agents, achieving significant performance improvements through ingenious "expand-compress" design. While there is room for improvement in theoretical analysis and computational efficiency, its technical innovation and experimental validation reach a high level, making important contributions to the retrieval-augmented generation field.