Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.
Large Language Models (LLMs) have demonstrated exceptional capability in leveraging external knowledge to enhance responses in multi-turn conversations and agent applications such as Retrieval-Augmented Generation (RAG). However, processing long context inputs introduces significant system latency and requires substantial memory for key-value caching, resulting in reduced throughput and a fundamental trade-off between knowledge richness and system efficiency. This paper proposes REFRAG, an efficient decoding framework that improves latency in RAG applications through compression, awareness, and expansion. By exploiting attention sparsity structures, it achieves 30.85× acceleration in time-to-first-token (TTFT) latency (3.75× improvement over prior work) without perplexity loss. Furthermore, the optimization framework enables REFRAG to extend LLMs' context size by 16×.
Efficiency Bottleneck in Long Context Processing: RAG systems face significant computational and memory overhead when processing long contexts, with time-to-first-token (TTFT) latency growing quadratically, severely impacting user experience.
Specificity of RAG Scenarios: Context in RAG primarily consists of concatenated retrieved passages, with only a small portion directly relevant to the query. Due to diversity and deduplication operations, these passages exhibit low semantic similarity, resulting in block-diagonal attention patterns.
Computational Redundancy: Existing methods treat RAG as a generic long-context problem, overlooking the sparse attention structures inherent to RAG, leading to substantial unnecessary computation.
Proposes REFRAG Framework: The first efficient decoding framework specifically designed for RAG applications, supporting context compression and expansion at arbitrary positions
Chunk Embedding Compression Technique: Uses precomputed compressed chunk embeddings to replace original tokens, achieving significant latency and memory optimization
Selective Compression Strategy: A reinforcement learning-based policy network that dynamically determines which chunks should maintain their original form
Significant Performance Gains: Achieves 30.85× TTFT acceleration, 16× context window expansion, with no performance degradation
Comprehensive Validation: Effectiveness verified across multiple tasks including RAG, multi-turn dialogue, and long document summarization
Given an input sequence x₁, x₂, ..., xₜ containing T tokens, where the first q tokens represent the primary input (e.g., query) and the remaining s tokens represent context (e.g., retrieved passages), satisfying q + s = T. The objective is to efficiently generate responses while minimizing TTFT latency and memory usage.
Arbitrary Position Compression: Overcomes the limitation of existing methods that only support prefix compression, enabling compression and expansion at any context position
Precomputation Reuse: Chunk embeddings can be precomputed and cached, avoiding repeated computational overhead
Adaptive Compression Rate: Dynamically adjusts compression rate through RL policy without requiring chunk embedding recalculation
Preserves Autoregressive Properties: Maintains the causal structure of the decoder, supporting multi-turn dialogue and summarization tasks
The paper presents attention visualization results confirming that in RAG scenarios, attention values between different passages are significantly lower than intra-passage attention, validating the block-diagonal sparsity assumption.
The paper cites extensive related work, primarily including:
Guu et al. (2020) - REALM retrieval-augmented pretraining
Borgeaud et al. (2022) - RETRO large-scale retrieval-augmented generation
Yen et al. (2024) - CEPE parallel context encoding
Touvron et al. (2023) - LLaMA base models
Overall Assessment: This is a high-quality research paper that proposes an innovative solution to efficiency bottlenecks in RAG systems. The method design is sound, experimental validation is comprehensive, and practical value is significant, making important contributions to the field's development.