2025-11-20T11:28:15.008705

REFRAG: Rethinking RAG based Decoding

Lin, Ghosh, Low et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

academic

REFRAG: Rethinking RAG based Decoding

基本信息

论文ID: 2509.01092
标题: REFRAG: Rethinking RAG based Decoding
作者: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
机构: Meta Superintelligence Labs, National University of Singapore, Rice University
分类: cs.CL cs.AI cs.LG
发表时间: October 14, 2025 (arXiv预印本)
论文链接: https://arxiv.org/abs/2509.01092

摘要

大型语言模型(LLMs)在检索增强生成(RAG)等多轮对话和智能体应用中展现了利用外部知识增强响应的卓越能力。然而，处理长上下文输入会带来显著的系统延迟，并需要大量内存用于键值缓存，导致吞吐量降低和知识丰富性与系统效率之间的根本权衡。本文提出REFRAG，一个高效的解码框架，通过压缩、感知和扩展来改善RAG应用中的延迟。通过利用注意力稀疏性结构，实现了30.85倍的首词延迟加速(比之前工作提升3.75倍)，且无困惑度损失。此外，该优化框架使REFRAG能够将LLMs的上下文大小扩展16倍。

研究背景与动机

核心问题

长上下文处理的效率瓶颈：RAG系统在处理长上下文时面临显著的计算和内存开销，时间到首词(TTFT)延迟呈二次增长，严重影响用户体验。
RAG场景的特殊性：RAG中的上下文主要由检索到的段落拼接而成，只有小部分与查询直接相关。由于多样性和去重操作，这些段落之间语义相似度较低，导致块对角注意力模式。
计算冗余：现有方法将RAG视为通用长上下文问题，忽视了RAG特有的稀疏注意力结构，导致大量不必要的计算。

研究动机

效率需求：Web规模应用对高吞吐量和低延迟的迫切需求
资源优化：减少内存占用和计算开销，提高系统可扩展性
性能保持：在大幅提升效率的同时保持模型性能不降级

核心贡献

提出REFRAG框架：首个专门针对RAG应用的高效解码框架，支持任意位置的上下文压缩和扩展
块嵌入压缩技术：使用预计算的压缩块嵌入替代原始token，实现显著的延迟和内存优化
选择性压缩策略：基于强化学习的策略网络，动态决定哪些块需要保持原始形式
显著性能提升：实现30.85倍TTFT加速，上下文窗口扩展16倍，无性能损失
广泛验证：在RAG、多轮对话、长文档摘要等多种任务上验证有效性

解码器：基于LLaMA的decoder-only基础模型
编码器：轻量级RoBERTa模型，用于处理上下文块
投影层：将块嵌入映射到解码器token空间

核心组件

块嵌入生成

上下文分块：{C₁, C₂, ..., Cₗ}，其中L = s/k
块嵌入：cᵢ = Mₑₙc(Cᵢ)
投影嵌入：eᶜⁿᵏᵢ = φ(cᵢ)

混合输入处理 解码器输入：{e₁, ..., eᵩ, eᶜⁿᵏ₁, ..., eᶜⁿᵏₗ} 压缩比例：≈ k倍减少
选择性压缩机制
- RL策略网络πθ决定哪些块保持未压缩
- 基于块嵌入和掩码进行序列化选择
- 奖励函数：负对数困惑度

技术创新点

任意位置压缩：突破现有方法仅支持前缀压缩的限制，支持上下文任意位置的压缩和扩展
预计算重用：块嵌入可预计算并缓存，避免重复计算开销
自适应压缩率：通过RL策略动态调整压缩率，无需重新计算块嵌入
保持自回归性质：维护解码器的因果结构，支持多轮对话和摘要任务

实验设置

数据集

预训练：SlimPajama数据集(20B tokens)，包含50% ArXiv + 50% Book数据
评估：Book、ArXiv、PG19、Proof-pile数据集
下游任务：
- RAG：1.1M样本，涵盖5个领域的QA数据集
- 多轮对话：TopiOCQA、ORConvQA、QReCC
- 摘要：ArXiv和PubMed长文档摘要