2025-11-20T11:28:15.008705

REFRAG: Rethinking RAG based Decoding

Lin, Ghosh, Low et al.
Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.
academic

REFRAG: Rethinking RAG based Decoding

Basic Information

  • Paper ID: 2509.01092
  • Title: REFRAG: Rethinking RAG based Decoding
  • Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
  • Institutions: Meta Superintelligence Labs, National University of Singapore, Rice University
  • Classification: cs.CL cs.AI cs.LG
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2509.01092

Abstract

Large Language Models (LLMs) have demonstrated exceptional capability in leveraging external knowledge to enhance responses in multi-turn conversations and agent applications such as Retrieval-Augmented Generation (RAG). However, processing long context inputs introduces significant system latency and requires substantial memory for key-value caching, resulting in reduced throughput and a fundamental trade-off between knowledge richness and system efficiency. This paper proposes REFRAG, an efficient decoding framework that improves latency in RAG applications through compression, awareness, and expansion. By exploiting attention sparsity structures, it achieves 30.85× acceleration in time-to-first-token (TTFT) latency (3.75× improvement over prior work) without perplexity loss. Furthermore, the optimization framework enables REFRAG to extend LLMs' context size by 16×.

Research Background and Motivation

Core Problems

  1. Efficiency Bottleneck in Long Context Processing: RAG systems face significant computational and memory overhead when processing long contexts, with time-to-first-token (TTFT) latency growing quadratically, severely impacting user experience.
  2. Specificity of RAG Scenarios: Context in RAG primarily consists of concatenated retrieved passages, with only a small portion directly relevant to the query. Due to diversity and deduplication operations, these passages exhibit low semantic similarity, resulting in block-diagonal attention patterns.
  3. Computational Redundancy: Existing methods treat RAG as a generic long-context problem, overlooking the sparse attention structures inherent to RAG, leading to substantial unnecessary computation.

Research Motivation

  • Efficiency Requirements: Urgent need for high throughput and low latency in web-scale applications
  • Resource Optimization: Reducing memory footprint and computational overhead to improve system scalability
  • Performance Preservation: Maintaining model performance while significantly improving efficiency

Core Contributions

  1. Proposes REFRAG Framework: The first efficient decoding framework specifically designed for RAG applications, supporting context compression and expansion at arbitrary positions
  2. Chunk Embedding Compression Technique: Uses precomputed compressed chunk embeddings to replace original tokens, achieving significant latency and memory optimization
  3. Selective Compression Strategy: A reinforcement learning-based policy network that dynamically determines which chunks should maintain their original form
  4. Significant Performance Gains: Achieves 30.85× TTFT acceleration, 16× context window expansion, with no performance degradation
  5. Comprehensive Validation: Effectiveness verified across multiple tasks including RAG, multi-turn dialogue, and long document summarization

Methodology Details

Task Definition

Given an input sequence x₁, x₂, ..., xₜ containing T tokens, where the first q tokens represent the primary input (e.g., query) and the remaining s tokens represent context (e.g., retrieved passages), satisfying q + s = T. The objective is to efficiently generate responses while minimizing TTFT latency and memory usage.

Model Architecture

Overall Design

REFRAG adopts an encoder-decoder architecture:

  • Decoder: Decoder-only base model based on LLaMA
  • Encoder: Lightweight RoBERTa model for processing context chunks
  • Projection Layer: Maps chunk embeddings to decoder token space

Core Components

  1. Chunk Embedding Generation
    Context Chunking: {C₁, C₂, ..., Cₗ}, where L = s/k
    Chunk Embedding: cᵢ = Mₑₙc(Cᵢ)
    Projected Embedding: eᶜⁿᵏᵢ = φ(cᵢ)
    
  2. Hybrid Input Processing Decoder Input: {e₁, ..., eᵩ, eᶜⁿᵏ₁, ..., eᶜⁿᵏₗ} Compression Ratio: ≈ k-fold reduction
  3. Selective Compression Mechanism
    • RL policy network πθ determines which chunks remain uncompressed
    • Sequential selection based on chunk embeddings and masking
    • Reward Function: Negative log perplexity

Technical Innovations

  1. Arbitrary Position Compression: Overcomes the limitation of existing methods that only support prefix compression, enabling compression and expansion at any context position
  2. Precomputation Reuse: Chunk embeddings can be precomputed and cached, avoiding repeated computational overhead
  3. Adaptive Compression Rate: Dynamically adjusts compression rate through RL policy without requiring chunk embedding recalculation
  4. Preserves Autoregressive Properties: Maintains the causal structure of the decoder, supporting multi-turn dialogue and summarization tasks

Experimental Setup

Datasets

  • Pretraining: SlimPajama dataset (20B tokens), comprising 50% ArXiv + 50% Book data
  • Evaluation: Book, ArXiv, PG19, Proof-pile datasets
  • Downstream Tasks:
    • RAG: 1.1M samples covering QA datasets from 5 domains
    • Multi-turn Dialogue: TopiOCQA, ORConvQA, QReCC
    • Summarization: ArXiv and PubMed long document summarization

Evaluation Metrics

  • Efficiency Metrics: TTFT, TTIT (token-to-token latency), throughput
  • Performance Metrics: Perplexity, accuracy, F1 score, ROUGE score
  • Memory Metrics: KV cache memory usage

Baseline Methods

  • LLaMA Variants: LLaMA-Full Context, LLaMA-No Context, LLaMA-32K
  • Existing Methods: CEPE, REPLUG
  • Different Compression Rates: REFRAG8, REFRAG16, REFRAG32

Implementation Details

  • Base Model: LLaMA-2-7B
  • Encoder: RoBERTa-Large (355M parameters)
  • Training Strategy: Curriculum learning + reconstruction task prewarming
  • Optimizer: AdamW, peak learning rate 5e-5
  • Hardware: 8 nodes × 8 H100 GPUs

Experimental Results

Main Results

Latency Performance

At 16K context length:

  • TTFT Acceleration: 16.53× (with cache), 8.59× (without cache)
  • Compared to CEPE: 2.01× TTFT improvement (with cache), 1.04× (without cache)
  • At k=32: Achieves 30.85× TTFT acceleration, 3.75× faster than CEPE

Model Performance

Compression RateArXiv P2048Book P2048PG19 P2048ProofPile P2048
REFRAG81.0621.8441.9270.916
REFRAG161.0761.8531.9380.931
CEPE1.1071.8641.9640.968

REFRAG16 achieves 9.3% average perplexity improvement over CEPE while enabling significant acceleration.

Ablation Studies

Necessity of Curriculum Learning

MethodP16P32P128P2048
Without Curriculum Learning3.7193.0982.2721.599
With Curriculum Learning0.6690.4510.2300.135

Curriculum learning is critical for successful reconstruction tasks.

Role of Reconstruction Task

MethodP16P32P128P2048
Without Reconstruction Prewarming3.2722.7892.1191.544
With Reconstruction Prewarming2.0171.8371.6321.453

Reconstruction task pretraining significantly improves continuous pretraining effectiveness.

RL Selective Compression

At the same compression rate 8, REFRAG16+RL consistently outperforms REFRAG8, demonstrating the effectiveness of dynamic compression strategies.

Downstream Task Performance

RAG Tasks

Under strong retriever settings, at equivalent latency constraints:

  • 8-passage REFRAG vs 1-passage LLaMA: 1.22% average improvement
  • Weak retriever settings show more pronounced improvements: 1.93%

Multi-turn Dialogue

At 10-passage settings, REFRAG outperforms LLaMAFT on all three datasets, with particularly notable advantages in long dialogue history scenarios.

Case Analysis

The paper presents attention visualization results confirming that in RAG scenarios, attention values between different passages are significantly lower than intra-passage attention, validating the block-diagonal sparsity assumption.

Retrieval-Augmented Language Modeling

  • REALM: First to propose retrieval-augmented masked language model pretraining
  • RETRO: Uses cross-attention and end-to-end pretraining
  • FiD: Processes passages in parallel and concatenates hidden states

Efficient Long-Context LLMs

  • Sparse Attention: Reduces attention complexity but doesn't address memory issues
  • StreamingLLM: Uses attention sink to reduce KV cache
  • CEPE: Cross-attention method, but limited to prefix application

Compressive Transformers

  • Compressive Transformer: Compresses KV cache but doesn't improve TTFT
  • Recursive Compression: Cannot precompute and reuse embeddings

Conclusions and Discussion

Main Conclusions

  1. RAG-Specific Sparsity: Block-diagonal attention patterns in RAG scenarios provide opportunities for specialized optimization
  2. Significant Efficiency Gains: 30.85× TTFT acceleration without performance loss demonstrates method effectiveness
  3. Broad Applicability: Demonstrates superior performance across multiple long-context tasks

Limitations

  1. Compression Rate Constraints: Experiments show significant performance degradation at k=64, indicating compression limits
  2. Encoder Overhead: Despite being lightweight, still requires additional encoding computation
  3. Training Complexity: Requires curriculum learning and multi-stage training strategies

Future Directions

  1. Higher Compression Rates: Explore more effective compression techniques to overcome current limitations
  2. End-to-End Optimization: Integrate compression strategies into the pretraining phase
  3. Multimodal Extension: Extend the method to multimodal scenarios such as vision-language tasks

In-Depth Evaluation

Strengths

  1. Precise Problem Identification: Accurately identifies RAG-specific characteristics and optimization opportunities
  2. Reasonable Method Design: Chunk embedding compression and selective strategy design are well-conceived
  3. Comprehensive Experimental Validation: Covers multiple tasks with thorough ablation studies
  4. High Practical Value: Significant performance improvements make it highly applicable
  5. Strong Technical Innovation: Notable innovations including arbitrary position compression and precomputation reuse

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical analysis of compression rate limits
  2. Encoder Selection: Insufficient exploration of different encoder architecture impacts
  3. Long-Term Dependencies: Handling capability for extremely long contexts remains to be verified
  4. Computational Complexity: RL training adds system complexity

Impact

  1. Academic Contribution: Opens new research directions for RAG system optimization
  2. Industrial Value: Directly applicable to large-scale RAG deployment
  3. Reproducibility: Authors commit to open-sourcing code, facilitating method adoption

Applicable Scenarios

  1. Web Search: Latency optimization in large-scale retrieval scenarios
  2. Knowledge QA: Complex question answering requiring integration of multiple document fragments
  3. Intelligent Assistants: Context management in multi-turn dialogue
  4. Document Analysis: Summarization and analysis of long documents

References

The paper cites extensive related work, primarily including:

  • Guu et al. (2020) - REALM retrieval-augmented pretraining
  • Borgeaud et al. (2022) - RETRO large-scale retrieval-augmented generation
  • Yen et al. (2024) - CEPE parallel context encoding
  • Touvron et al. (2023) - LLaMA base models

Overall Assessment: This is a high-quality research paper that proposes an innovative solution to efficiency bottlenecks in RAG systems. The method design is sound, experimental validation is comprehensive, and practical value is significant, making important contributions to the field's development.