2025-11-13T01:58:10.933950

MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity

Tang, Gao, Li et al.

Retrieval Augmented Generation (RAG) has proven to be highly effective in boosting the generative performance of language model in knowledge-intensive tasks. However, existing RAG framework either indiscriminately perform retrieval or rely on rigid single-class classifiers to select retrieval methods, leading to inefficiencies and suboptimal performance across queries of varying complexity. To address these challenges, we propose a reinforcement learning-based framework that dynamically selects the most suitable retrieval strategy based on query complexity. % our solution Our approach leverages a multi-armed bandit algorithm, which treats each retrieval method as a distinct ``arm'' and adapts the selection process by balancing exploration and exploitation. Additionally, we introduce a dynamic reward function that balances accuracy and efficiency, penalizing methods that require more retrieval steps, even if they lead to a correct result. Our method achieves new state of the art results on multiple single-hop and multi-hop datasets while reducing retrieval costs. Our code are available at https://github.com/FUTUREEEEEE/MBA .

academic

MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity

Basic Information

Paper ID: 2412.01572
Title: MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity
Authors: Xiaqiang Tang, Qiang Gao, Jian Li, Nan Du, Qi Li, Sihong Xie
Affiliations: Hong Kong University of Science and Technology (Guangzhou), Tencent Hunyuan, Wuhan University, Iowa State University
Classification: cs.AI
Publication Date: January 1, 2025 (arXiv v4)
Paper Link: https://arxiv.org/abs/2412.01572
Code Link: https://github.com/FUTUREEEEEE/MBA

Abstract

Retrieval-Augmented Generation (RAG) significantly enhances the generation performance of language models in knowledge-intensive tasks. However, existing RAG frameworks either indiscriminately perform retrieval or rely on rigid single-class classifiers to select retrieval methods, resulting in inefficiency and suboptimal performance across queries of varying complexity. To address these challenges, this paper proposes a reinforcement learning-based framework that dynamically selects the most appropriate retrieval strategy based on query complexity. The method leverages multi-armed bandit algorithms, treating each retrieval method as a different "arm," and balances exploration and exploitation to adapt the selection process. Additionally, a dynamic reward function is introduced that balances accuracy and efficiency, penalizing methods requiring more retrieval steps even when obtaining correct results. The method achieves new state-of-the-art results on multiple single-hop and multi-hop datasets while reducing retrieval costs.

Research Background and Motivation

Problem Definition

Existing RAG systems face the following core issues:

Inappropriate Retrieval Strategy Selection: Most RAG frameworks perform retrieval indiscriminately across all queries, potentially introducing unnecessary or off-topic passages
Single Method Limitations: Using a single retrieval method for all queries is inefficient—simple queries incur unnecessary computational overhead, while complex queries may not receive sufficient processing
Inaccurate Supervision Signals: Existing adaptive methods such as AdaptiveRAG use heuristic supervision, assuming each query has a single optimal strategy and tending to select the path with minimal retrieval cost

Research Motivation

The core motivation of this paper is to develop a framework that can:

Dynamically Adapt to Query Complexity: Intelligently select retrieval strategies based on problem complexity
Balance Accuracy and Efficiency: Minimize computational costs while ensuring answer quality
Support Multi-Strategy Exploration: Allow multiple strategies to potentially produce correct answers rather than forcing selection of a single "optimal" path

Core Contributions

Proposes MBA-RAG Framework: First application of multi-armed bandit algorithms to RAG system retrieval strategy selection, enabling dynamic adaptive retrieval
Designs Dynamic Reward Function: Innovatively combines accuracy and computational efficiency by penalizing high-cost methods to optimize resource utilization
Achieves State-of-the-Art Performance: Obtains best results on 6 datasets while reducing retrieval costs by 20%
Provides Flexible Supervision Mechanism: Uses partial information supervision instead of strict single-label supervision, allowing the model to explore multiple effective strategies

Methodology Details

Task Definition

Given a query x, the RAG system needs to:

Retrieval Phase: Module R retrieves relevant documents D for query x
Generation Phase: LLM generates response ā = LLM(yt|x,D) using x and D

This paper redefines this as a multi-armed bandit problem, where each retrieval method (no retrieval, single retrieval, multiple retrievals) serves as an "arm."

Model Architecture

1. Query Encoding and Arm Selection

Encoder: Uses DistilBERT to encode user queries, generating action distribution z = fθ(x)
Selection Strategy: Employs ε-greedy strategy to balance exploration and exploitation:
- Select a = argmax(z) with probability (1-ε)
- Randomly select generation method with probability ε

2. Learning Algorithm

The objective function minimizes squared error between actual reward ra and predicted reward fθ(x)a:

min_θ (ra - fθ(x)a)²

Parameter update rule:

θt+1 = θt - α∇θ((ra - fθ(x)a)²)

3. Dynamic Reward Function

ra = A(y, ŷa) - λC(a)

Where:

A(y, ŷa): Generation quality metric (e.g., exact match)
C(a): Computational cost of method a (e.g., number of retrieval steps)
λ: Scaling factor balancing accuracy and efficiency

Technical Innovations

Multi-Armed Bandit Adaptation: Models retrieval strategy selection as a multi-armed bandit problem, with each retrieval method corresponding to an "arm"
Partial Information Supervision: Provides feedback only for selected strategies without penalizing unselected ones
Cost-Aware Rewards: Dynamic reward function simultaneously considers accuracy and computational efficiency
Exploration-Exploitation Balance: Uses ε-greedy strategy to avoid premature convergence to suboptimal solutions

Experimental Setup

Datasets

Single-Hop QA Datasets:

SQuAD v1.1: Reading comprehension task
Natural Questions: Open-domain question answering
TriviaQA: Knowledge-based question answering

Multi-Hop QA Datasets:

MuSiQue: Multi-step reasoning question answering
HotpotQA: Multi-hop reasoning question answering
2WikiMultiHopQA: Wikipedia-based multi-hop question answering

Evaluation Metrics

Performance Metrics:

EM (Exact Match): Predicted result exactly matches ground truth
F1: Lexical overlap between predicted and ground truth answers
Acc (Accuracy): Whether predicted answer contains ground truth

Efficiency Metrics:

Step: Number of retrieval steps required by selected strategy

Baseline Methods

No-Retrieval: Direct answer generation without retrieval
Adaptive-Retrieval: Dynamically determines whether retrieval is needed
Self-RAG: Dynamically decides retrieval needs through self-reflection
DRAGIN: Activates retrieval based on token uncertainty
SEAKR: Determines retrieval based on self-aware uncertainty
Adaptive-RAG: Uses classifier to select retrieval strategies based on query complexity

Implementation Details

Query Encoding Model: DistilBERT
Retrieval Model: BM25
Generation Model: FLAN-T5-XL (3B)
Learning Rate: 5e-5
Exploration Strategy: ε-greedy algorithm

Experimental Results

Main Results

Method	EM	F1	Acc	Step
No Retrieval	14.87	21.12	15.97	0.00
Adaptive Retrieval	23.87	32.24	26.73	0.50
Self-RAG	9.90	20.79	31.57	0.72
Adaptive-RAG	37.17	46.94	42.10	2.17
MBA-RAG (Ours)	38.80	48.61	43.57	1.80

Key Findings

Performance Improvement: MBA-RAG surpasses baseline methods on all performance metrics
Efficiency Optimization: Compared to Adaptive-RAG, retrieval steps reduced by approximately 17% (from 2.17 to 1.80)
Single-Hop Dataset Performance: Achieves significant improvements on SQuAD and TriviaQA with substantially reduced retrieval costs
Multi-Hop Dataset Performance: Achieves outstanding improvements on 2WikiMultiHopQA with retrieval cost reduction exceeding 20%

Classification Accuracy Analysis

MBA-RAG achieves classification accuracy of 56.1%, significantly higher than:

Adaptive Retrieval: 42.0%
Self-RAG: 41.5%
Adaptive-RAG: 54.0%

Ablation Study

Comparison with multi-label classifier results shows that while traditional multi-label methods achieve good performance, retrieval costs are excessive (Step reaches 4.514), whereas MBA-RAG achieves optimal balance between performance and efficiency.

RAG System Development

Traditional RAG: Retrieval-generation framework proposed by Lewis et al. (2020)
Adaptive Retrieval: Methods such as SEAKR and FLARE implement on-demand retrieval
Complexity-Aware Approaches: AdaptiveRAG selects strategies based on query complexity

Multi-Armed Bandit Applications

This paper is the first to apply multi-armed bandit algorithms to RAG systems, providing a new theoretical framework for retrieval strategy selection.

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: MBA-RAG achieves state-of-the-art performance on multiple datasets
Efficiency Improvement: Significantly reduces retrieval costs with average reduction of 20%
Strong Adaptability: Dynamically adjusts strategies based on query complexity

Limitations

Algorithm Dependency: Framework relies on specific multi-armed bandit algorithm structure
Scalability Challenges: May face adaptability issues when encountering new unseen query types
Computational Requirements: Reinforcement learning approach may introduce additional computational overhead

Future Directions

Algorithm Optimization: Explore more efficient algorithms to reduce computational requirements
Generalization Capability: Improve adaptability to new query types
Application Extension: Apply method to broader range of NLP tasks

In-Depth Evaluation

Strengths

Strong Innovation: First to introduce multi-armed bandits to RAG systems with solid theoretical foundation
High Practical Value: Simultaneously optimizes accuracy and efficiency with important application value
Comprehensive Experiments: Thorough evaluation across 6 different types of datasets
Reasonable Methodology: Cleverly designed dynamic reward function balancing multiple objectives

Weaknesses

Increased Complexity: Introduces additional algorithmic complexity compared to simple classification methods
Parameter Sensitivity: Balance parameter λ in reward function requires adjustment for different datasets
Insufficient Theoretical Analysis: Lacks theoretical guarantees for convergence and optimality

Impact

Academic Contribution: Provides new research direction for RAG system optimization
Practical Application: Method demonstrates strong practicality applicable to real systems
Reproducibility: Provides complete code implementation facilitating reproduction and extension

Applicable Scenarios

Knowledge-Intensive Question Answering: Particularly suitable for scenarios requiring balance between accuracy and efficiency
Multi-Complexity Query Processing: Handles queries ranging from simple to complex
Resource-Constrained Environments: Optimizes retrieval costs when computational resources are limited

References

Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS.
Jeong, S., et al. (2024). Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint.
Katehakis, M. N., & Veinott Jr, A. F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research.

Overall Assessment: This paper proposes an innovative and practical RAG optimization framework that achieves dynamic retrieval strategy selection through multi-armed bandit algorithms, significantly reducing computational costs while maintaining high accuracy. The method has solid theoretical foundations, convincing experimental results, and provides valuable insights for further development of RAG systems.