2025-11-20T11:28:15.008705

REFRAG: Rethinking RAG based Decoding

Lin, Ghosh, Low et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

academic

REFRAG: Rethinking RAG based Decoding

Basic Information

Paper ID: 2509.01092
Title: REFRAG: Rethinking RAG based Decoding
Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
Institutions: Meta Superintelligence Labs, National University of Singapore, Rice University
Classification: cs.CL cs.AI cs.LG
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2509.01092

Abstract

Large Language Models (LLMs) have demonstrated exceptional capability in leveraging external knowledge to enhance responses in multi-turn conversations and agent applications such as Retrieval-Augmented Generation (RAG). However, processing long context inputs introduces significant system latency and requires substantial memory for key-value caching, resulting in reduced throughput and a fundamental trade-off between knowledge richness and system efficiency. This paper proposes REFRAG, an efficient decoding framework that improves latency in RAG applications through compression, awareness, and expansion. By exploiting attention sparsity structures, it achieves 30.85× acceleration in time-to-first-token (TTFT) latency (3.75× improvement over prior work) without perplexity loss. Furthermore, the optimization framework enables REFRAG to extend LLMs' context size by 16×.

Research Background and Motivation

Core Problems

Efficiency Bottleneck in Long Context Processing: RAG systems face significant computational and memory overhead when processing long contexts, with time-to-first-token (TTFT) latency growing quadratically, severely impacting user experience.
Specificity of RAG Scenarios: Context in RAG primarily consists of concatenated retrieved passages, with only a small portion directly relevant to the query. Due to diversity and deduplication operations, these passages exhibit low semantic similarity, resulting in block-diagonal attention patterns.
Computational Redundancy: Existing methods treat RAG as a generic long-context problem, overlooking the sparse attention structures inherent to RAG, leading to substantial unnecessary computation.

Research Motivation

Efficiency Requirements: Urgent need for high throughput and low latency in web-scale applications
Resource Optimization: Reducing memory footprint and computational overhead to improve system scalability
Performance Preservation: Maintaining model performance while significantly improving efficiency

Core Contributions

Proposes REFRAG Framework: The first efficient decoding framework specifically designed for RAG applications, supporting context compression and expansion at arbitrary positions
Chunk Embedding Compression Technique: Uses precomputed compressed chunk embeddings to replace original tokens, achieving significant latency and memory optimization
Selective Compression Strategy: A reinforcement learning-based policy network that dynamically determines which chunks should maintain their original form
Significant Performance Gains: Achieves 30.85× TTFT acceleration, 16× context window expansion, with no performance degradation
Comprehensive Validation: Effectiveness verified across multiple tasks including RAG, multi-turn dialogue, and long document summarization

Methodology Details

Task Definition

Given an input sequence x₁, x₂, ..., xₜ containing T tokens, where the first q tokens represent the primary input (e.g., query) and the remaining s tokens represent context (e.g., retrieved passages), satisfying q + s = T. The objective is to efficiently generate responses while minimizing TTFT latency and memory usage.

Model Architecture

Overall Design

REFRAG adopts an encoder-decoder architecture:

Decoder: Decoder-only base model based on LLaMA
Encoder: Lightweight RoBERTa model for processing context chunks
Projection Layer: Maps chunk embeddings to decoder token space

Core Components

Chunk Embedding Generation

Context Chunking: {C₁, C₂, ..., Cₗ}, where L = s/k
Chunk Embedding: cᵢ = Mₑₙc(Cᵢ)
Projected Embedding: eᶜⁿᵏᵢ = φ(cᵢ)

Hybrid Input Processing Decoder Input: {e₁, ..., eᵩ, eᶜⁿᵏ₁, ..., eᶜⁿᵏₗ} Compression Ratio: ≈ k-fold reduction
Selective Compression Mechanism
- RL policy network πθ determines which chunks remain uncompressed
- Sequential selection based on chunk embeddings and masking
- Reward Function: Negative log perplexity

Technical Innovations

Arbitrary Position Compression: Overcomes the limitation of existing methods that only support prefix compression, enabling compression and expansion at any context position
Precomputation Reuse: Chunk embeddings can be precomputed and cached, avoiding repeated computational overhead
Adaptive Compression Rate: Dynamically adjusts compression rate through RL policy without requiring chunk embedding recalculation
Preserves Autoregressive Properties: Maintains the causal structure of the decoder, supporting multi-turn dialogue and summarization tasks

Experimental Setup

Datasets

Pretraining: SlimPajama dataset (20B tokens), comprising 50% ArXiv + 50% Book data
Evaluation: Book, ArXiv, PG19, Proof-pile datasets
Downstream Tasks:
- RAG: 1.1M samples covering QA datasets from 5 domains
- Multi-turn Dialogue: TopiOCQA, ORConvQA, QReCC
- Summarization: ArXiv and PubMed long document summarization

Evaluation Metrics

Efficiency Metrics: TTFT, TTIT (token-to-token latency), throughput
Performance Metrics: Perplexity, accuracy, F1 score, ROUGE score
Memory Metrics: KV cache memory usage

Baseline Methods

LLaMA Variants: LLaMA-Full Context, LLaMA-No Context, LLaMA-32K
Existing Methods: CEPE, REPLUG
Different Compression Rates: REFRAG8, REFRAG16, REFRAG32

Implementation Details

Base Model: LLaMA-2-7B
Encoder: RoBERTa-Large (355M parameters)
Training Strategy: Curriculum learning + reconstruction task prewarming
Optimizer: AdamW, peak learning rate 5e-5
Hardware: 8 nodes × 8 H100 GPUs

Experimental Results

Main Results

Latency Performance

At 16K context length:

TTFT Acceleration: 16.53× (with cache), 8.59× (without cache)
Compared to CEPE: 2.01× TTFT improvement (with cache), 1.04× (without cache)
At k=32: Achieves 30.85× TTFT acceleration, 3.75× faster than CEPE

Model Performance

Compression Rate	ArXiv P2048	Book P2048	PG19 P2048	ProofPile P2048
REFRAG8	1.062	1.844	1.927	0.916
REFRAG16	1.076	1.853	1.938	0.931
CEPE	1.107	1.864	1.964	0.968

REFRAG16 achieves 9.3% average perplexity improvement over CEPE while enabling significant acceleration.

Ablation Studies

Necessity of Curriculum Learning

Method	P16	P32	P128	P2048
Without Curriculum Learning	3.719	3.098	2.272	1.599
With Curriculum Learning	0.669	0.451	0.230	0.135

Curriculum learning is critical for successful reconstruction tasks.

Role of Reconstruction Task

Method	P16	P32	P128	P2048
Without Reconstruction Prewarming	3.272	2.789	2.119	1.544
With Reconstruction Prewarming	2.017	1.837	1.632	1.453

Reconstruction task pretraining significantly improves continuous pretraining effectiveness.

RL Selective Compression

At the same compression rate 8, REFRAG16+RL consistently outperforms REFRAG8, demonstrating the effectiveness of dynamic compression strategies.

Downstream Task Performance

RAG Tasks

Under strong retriever settings, at equivalent latency constraints:

8-passage REFRAG vs 1-passage LLaMA: 1.22% average improvement
Weak retriever settings show more pronounced improvements: 1.93%

Multi-turn Dialogue

At 10-passage settings, REFRAG outperforms LLaMAFT on all three datasets, with particularly notable advantages in long dialogue history scenarios.

Case Analysis

The paper presents attention visualization results confirming that in RAG scenarios, attention values between different passages are significantly lower than intra-passage attention, validating the block-diagonal sparsity assumption.

Retrieval-Augmented Language Modeling

REALM: First to propose retrieval-augmented masked language model pretraining
RETRO: Uses cross-attention and end-to-end pretraining
FiD: Processes passages in parallel and concatenates hidden states

Efficient Long-Context LLMs

Sparse Attention: Reduces attention complexity but doesn't address memory issues
StreamingLLM: Uses attention sink to reduce KV cache
CEPE: Cross-attention method, but limited to prefix application

Compressive Transformers

Compressive Transformer: Compresses KV cache but doesn't improve TTFT
Recursive Compression: Cannot precompute and reuse embeddings

Conclusions and Discussion

Main Conclusions

RAG-Specific Sparsity: Block-diagonal attention patterns in RAG scenarios provide opportunities for specialized optimization
Significant Efficiency Gains: 30.85× TTFT acceleration without performance loss demonstrates method effectiveness
Broad Applicability: Demonstrates superior performance across multiple long-context tasks

Limitations

Compression Rate Constraints: Experiments show significant performance degradation at k=64, indicating compression limits
Encoder Overhead: Despite being lightweight, still requires additional encoding computation
Training Complexity: Requires curriculum learning and multi-stage training strategies

Future Directions

Higher Compression Rates: Explore more effective compression techniques to overcome current limitations
End-to-End Optimization: Integrate compression strategies into the pretraining phase
Multimodal Extension: Extend the method to multimodal scenarios such as vision-language tasks

In-Depth Evaluation

Strengths

Precise Problem Identification: Accurately identifies RAG-specific characteristics and optimization opportunities
Reasonable Method Design: Chunk embedding compression and selective strategy design are well-conceived
Comprehensive Experimental Validation: Covers multiple tasks with thorough ablation studies
High Practical Value: Significant performance improvements make it highly applicable
Strong Technical Innovation: Notable innovations including arbitrary position compression and precomputation reuse

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical analysis of compression rate limits
Encoder Selection: Insufficient exploration of different encoder architecture impacts
Long-Term Dependencies: Handling capability for extremely long contexts remains to be verified
Computational Complexity: RL training adds system complexity

Impact

Academic Contribution: Opens new research directions for RAG system optimization
Industrial Value: Directly applicable to large-scale RAG deployment
Reproducibility: Authors commit to open-sourcing code, facilitating method adoption

Applicable Scenarios

Web Search: Latency optimization in large-scale retrieval scenarios
Knowledge QA: Complex question answering requiring integration of multiple document fragments
Intelligent Assistants: Context management in multi-turn dialogue
Document Analysis: Summarization and analysis of long documents

References

The paper cites extensive related work, primarily including:

Guu et al. (2020) - REALM retrieval-augmented pretraining
Borgeaud et al. (2022) - RETRO large-scale retrieval-augmented generation
Yen et al. (2024) - CEPE parallel context encoding
Touvron et al. (2023) - LLaMA base models

Overall Assessment: This is a high-quality research paper that proposes an innovative solution to efficiency bottlenecks in RAG systems. The method design is sound, experimental validation is comprehensive, and practical value is significant, making important contributions to the field's development.