Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.
- Paper ID: 2510.10009
- Title: Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning
- Authors: Shu Zhao (NVIDIA & Pennsylvania State University), Tan Yu (NVIDIA), Anbang Xu (NVIDIA)
- Classification: cs.CL cs.AI cs.IR
- Publication Date: 2025-10-14 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.10009
Reasoning-augmented search agents (such as Search-R1) are trained to iteratively reason, search, and generate final answers. However, due to their limited capabilities in reasoning and searching, their performance on multi-hop question-answering benchmarks remains suboptimal. To handle complex or composite queries, the authors train an LLM-based search agent with native query expansion capabilities through reinforcement learning. In each round, the search agent proposes multiple query variants while searching simultaneously to cover more relevant information. Considering limited post-training data and computational resources, the search agent struggles to master multiple tasks, including query generation, retrieval information comprehension, and answer generation. Therefore, the authors propose incorporating a pre-trained compressor model to help the search agent understand retrieved documents, enabling the search agent to focus on query generation for high retrieval recall. With the assistance of the compressor model, the authors demonstrate that even small-scale 3B LLMs can exhibit strong query expansion capabilities and achieve state-of-the-art accuracy on multi-hop question-answering benchmarks. Specifically, experiments on seven question-answering benchmarks show that the ExpandSearch method achieves an average improvement of 4.4% over state-of-the-art baselines, with significant gains on multi-hop reasoning tasks requiring diverse evidence aggregation.
Existing reasoning-augmented search agents face two core challenges:
- Semantic Incompleteness: Generated queries are semantically impoverished, failing to capture the complete range of relevant information, particularly when facing multi-faceted questions requiring diverse evidence
- Information Overload: Retrieved content contains substantial irrelevant information, obscuring critical facts and degrading reasoning quality
Multi-hop question-answering tasks require aggregating evidence from multiple perspectives. The semantic limitations of single queries and theoretical constraints of vector embedding-based retrieval severely restrict system performance. This problem is particularly acute in complex reasoning scenarios where agents must navigate extensive search results to identify sparse yet critical evidence.
- Methods like Search-R1 generate only a single query per round, easily missing critical semantic information
- Lengthy retrieved content leads to high computational costs, substantial GPU memory consumption, and significantly reduced training speed
- Signal-to-noise ratio problems are particularly severe in multi-hop reasoning tasks
The authors' core insight is that effective information retrieval requires a dual strategy—expanding the query space to maximize relevant information coverage, then selectively distilling retrieved content to retain only facts critical for reasoning. This "expand-compress" paradigm reflects human information-seeking behavior.
- Identified and formalized the dual problem: Semantic incompleteness and information overload in reasoning-augmented search agents, with empirical analysis demonstrating that both problems significantly reduce performance on complex reasoning tasks
- Proposed the ExpandSearch framework: An "expand-compress" framework combining reinforcement learning-based query expansion and prompt-based selective information refinement, achieving high recall in multi-step reasoning scenarios while maintaining precision
- Achieved significant performance improvements: Substantial improvements over state-of-the-art baselines on seven benchmarks, with particularly strong performance on multi-hop reasoning tasks requiring diverse evidence aggregation
Given an input query x, the search agent must generate a final answer y through an iterative reasoning-search process, where each round can invoke a search engine R to retrieve relevant document chunks and perform reasoning based on retrieved information.
Expand Phase:
- LLM generates
<search></search> blocks containing n diverse queries {qi} - Each query qi retrieves k most relevant chunks through search engine R: Ci = c1i, ..., cki ← R(qi)
- Effectively overcomes single-query retrieval limitations and improves retrieval recall
Squeeze Phase:
- Input generated queries q1, ..., qn and retrieved chunks C1, ..., Cn into a frozen LLM compressor πs
- Generate summary: s = πs(q1, ..., qn, C1, ..., Cn)
- Compressed information s is encapsulated in
<information></information> blocks inserted into the ongoing generation sequence
Two complementary expansion types naturally discovered through reinforcement learning:
- Syntactic Expansion: Handles surface form variations, e.g., "where did he die" → "his death place"
- Semantic Expansion: Broadens information scope, e.g., "Alex's father" → "Alex's family"
- Search Agent: Focuses on query generation to achieve high retrieval recall
- Compressor Model: Independently handles retrieved document comprehension through API calls for decoupling
Employs a weighted combination reward function: r = rEM + λrf
- rEM: Exact match reward, equals 1 when predicted answer exactly matches ground truth
- rf: Format reward, equals 1 when predicted answer strictly follows format
- λ defaults to 0.2
Covers seven benchmarks, divided into two categories:
- General Question-Answering: NQ, TriviaQA, PopQA
- Multi-hop Question-Answering: HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle
Following Jin et al.'s setup, combines NQ and HotpotQA training sets for training, evaluates on validation/test sets for in-domain and out-of-domain generalization.
Uses Exact Match (EM) as the primary evaluation metric.
- R1 without search engine
- Search-R1
- ZeroSearch
- StepSearch
- Router-R1
- ParallelSearch
- Backbone Model: Qwen-2.5-Base/Instruct (3B/7B)
- Embedding Model: E5
- Corpus: 2018 Wikipedia dump
- Hardware: 8×NVIDIA H100 GPUs
- Algorithm: PPO (Proximal Policy Optimization)
- Batching: Total batch size 512, mini-batch 256, micro-batch 64
Consistent and significant improvements across all configurations:
- Average improvement of 4.4%: Absolute improvement over strongest baseline
- Small model advantage: 3B-Instruct model (0.457 average EM) surpasses 7B baseline methods
- Architecture impact: Instruct variant outperforms base variant by 2.2% in 3B models; base variant outperforms instruct variant by 3.1% in 7B models
Significant performance gains from increasing 1 to 3 queries:
- n=1 to n=2: Average improvement of 6.7%
- Continued improvement at n=3, but with diminishing returns
- ExpandSearch (n=3, k=5) improves 34.3% over Search-R1 (k=15)
- Adding expansion prompts without RL training even reduces performance
- Demonstrates the critical role of end-to-end training in learning effective query expansion strategies
- Syntactic expansion comprises 63.35%, semantic expansion comprises 36.65%
- Removing either type results in performance degradation, confirming their complementarity
- Retrieval Depth: Consistent but diminishing gains from increasing k from 3 to 10
- Model Selection: LLaMA-3.1-70B performs better on general question-answering, LLaMA-4-17B better on multi-hop reasoning
- Generalization Ability: Comparable performance when using different compressor models during training and inference
- Reward, response length, and search frequency increase synchronously
- Model autonomously learns to increase search frequency as a strategy for improving answer quality
- Smooth training curves indicate stable optimization process
- RAG Systems: Two-stage pipelines with retrieval followed by generation, but often contain irrelevant information
- Search Tool Frameworks: Such as IRCoT, ReAct guided by prompting; Toolformer through supervised fine-tuning
- Reinforcement Learning Methods: Search-R1 pioneered RL techniques; subsequent developments include ZeroSearch, MaskSearch, etc.
- RLHF: Training reward models through human preference annotations
- Efficiency Optimization: DPO, SimPO, ORPO methods bypass reward model training
- Emerging Techniques: GRPO, RLOO offer promising alternatives through group-wise policy evaluation
- ExpandSearch effectively addresses single-query retrieval limitations through learned query expansion and selective information refinement
- The "expand-compress" paradigm successfully tackles the dual challenges of semantic incompleteness and information overload
- Even 3B-scale models can exhibit strong query expansion capabilities and achieve state-of-the-art performance
- Computational Cost: Multiple query retrieval and compressor invocations increase inference time
- Dependency: Performance depends on compressor model quality
- Expansion Saturation: Diminishing returns from increasing query count
- Adaptive Retrieval Strategies: Dynamically adjust expansion count based on query complexity
- More Efficient Training Methods: Reduce dependence on large-scale computational resources
- End-to-End Optimization: Jointly train search agent and compressor model
- Methodological Innovation: First combination of query expansion with reinforcement learning; ingenious "expand-compress" paradigm design
- Experimental Comprehensiveness: Seven benchmarks, multiple model scales, detailed ablation studies
- Technical Insights: Discovery of syntactic and semantic expansion complementarity provides valuable technical insights
- Practical Value: Small models achieve excellent performance, demonstrating practical deployment value
- Insufficient Theoretical Analysis: Lacks theoretical explanation for why this approach works
- Computational Efficiency: Insufficient analysis of computational overhead from multiple query retrieval
- Generalization Ability: Primarily validated on question-answering tasks; applicability to other tasks unknown
- Compressor Dependency: Reliance on external compressor models may limit application scenarios
- Academic Contribution: Provides new research directions for retrieval-augmented generation field
- Practical Value: Modular design facilitates practical application and deployment
- Reproducibility: Provides detailed implementation details and open-source commitment
- Multi-hop Question-Answering Systems: Particularly suitable for question-answering tasks requiring complex reasoning
- Information Retrieval Systems: Applicable to retrieval scenarios requiring high recall
- Dialogue Systems: Can be integrated into dialogue agents requiring external knowledge
The paper cites multiple important works, including:
- Search-R1 (Jin et al., 2025b): Pioneering RL search agent work
- RLHF-related works (Ouyang et al., 2022): Foundation for RL-based LLM training
- Multiple question-answering datasets: Standard benchmarks including NQ, HotpotQA, TriviaQA
This paper proposes an innovative solution to address core challenges in current search agents, achieving significant performance improvements through ingenious "expand-compress" design. While there is room for improvement in theoretical analysis and computational efficiency, its technical innovation and experimental validation reach a high level, making important contributions to the retrieval-augmented generation field.