2025-11-19T17:22:13.046982

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Yang, Leng, Zeng et al.

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

academic

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Basic Information

Paper ID: 2510.10129
Title: CacheClip: Accelerating RAG with Effective KV Cache Reuse
Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu (Intel Corporation)
Categories: cs.LG cs.AI
Publication Date: October 14, 2025
Paper Link: https://arxiv.org/abs/2510.10129v1

Abstract

Retrieval-Augmented Generation (RAG) systems suffer from severe Time-To-First-Token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face fundamental trade-offs: prefix caching requires identical prefixes but rarely occurs in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and redundant attention aggregation. This paper proposes the CacheClip framework, which achieves fast TTFT and high generation quality through auxiliary model-guided token selection, shared prefix elimination of redundant attention aggregation, and grouping strategies to maintain local consistency. Experiments demonstrate that CacheClip maintains 94.8% and 85.0% of full attention performance on NIAH and LongBench respectively, achieving up to 1.92× acceleration in prefill time.

Research Background and Motivation

Problem Definition

The core problem faced by RAG systems is the Time-To-First-Token (TTFT) bottleneck. Due to the need to process large numbers of retrieved document chunks (typically 4K-16K tokens), the attention computation in the prefill phase exhibits quadratic complexity, resulting in poor user experience. For example, processing 200K input tokens on an A100 GPU requires over 20 seconds of TTFT.

Limitations of Existing Methods

Prefix Caching: Requires completely identical prefixes, but retrieved chunks frequently change in RAG scenarios, resulting in low actual reuse rates
Direct Precomputation: Independently computes KV caches for each chunk and concatenates them, with two critical issues:
- Missing inter-chunk attention, affecting cross-document reasoning
- Redundant attention aggregation effect, mismatched with training-time attention distributions
Existing Improvements:
- APE: Only addresses attention aggregation, cannot recover inter-chunk attention
- CacheBlend: Selects tokens based on early layers, potentially missing critical tokens in deeper layers

Research Motivation

There is a need for a method that can significantly accelerate inference while maintaining generation quality, particularly in complex RAG tasks requiring cross-document reasoning.

Core Contributions

Key Observation: Discovers that the last-layer attention distribution of small auxiliary LLMs is highly similar to large primary models, enabling efficient identification of important tokens
CacheClip Framework: A novel framework integrating three techniques:
- Auxiliary model-guided token selection for selective KV cache recomputation
- Shared prefix elimination of redundant attention aggregation
- Grouping strategies to maintain local consistency
Performance Gains: Achieves 94.8% and 85.0% of full attention performance on NIAH and LongBench respectively, while delivering 1.92× prefill acceleration
Practical System Design: Auxiliary model runs on CPU, avoiding additional GPU overhead

Method Details

Task Definition

Given a user query and a set of retrieved document chunks, the objective is to minimize prefill latency while maintaining generation quality. The input consists of query q and document chunk set {D₁, D₂, ..., Dₙ}, with the output being a high-quality response.

Core Technical Components

1. Attention Aggregation Handling

Problem: Independently processed document chunks exhibit attention aggregation effects at the beginning
Solution: Add a shared prefix (e.g., system prompt) to each chunk, retaining only the first chunk's prefix during concatenation
Effect: Recovers global attention distribution consistent with training time

2. Position ID Reordering

Problem: Concatenated position IDs exhibit repetitive patterns
Solution: Reassign continuously increasing position IDs
Implementation: Reorder from [0,1,2,...,sink_size,sink_size+1,...,sink_size+chunk1_size,sink_size+1,...] to [0,1,2,...,sink_size,sink_size+1,...,sink_size+chunk1_size,sink_size+chunk1_size+1,...]

3. Auxiliary Model-Guided Token Selection

Core Insight: The last-layer attention of small auxiliary models (e.g., SmolLM2-135M) is highly similar to large primary models (e.g., Qwen2.5-14B)
Quantitative Verification:
- KL Divergence: KL divergence between auxiliary and primary models' last layers < KL divergence between primary model's first and last layers
- Jaccard Index: Higher overlap in top-20% important tokens
Selection Strategy:
1. Precompute KV caches for each chunk in the auxiliary model
2. Concatenate chunks with query for batch processing
3. Extract last-layer attention matrix, computing attention weights of query tokens over chunk tokens
4. Average across query dimension to obtain importance scores for each token

4. Grouping Strategy

Motivation: Avoid sparse KV cache updates disrupting context completeness
Implementation:
- Partition sequence into small windows (default 8 tokens)
- If selected tokens in window exceed threshold (default 5), recompute the window
- Otherwise skip the window, maintaining local context consistency

5. Token Mapping and KV Cache Update

Handle tokenizer differences between auxiliary and primary models
Recompute KV caches for selected segments, maintaining position ID consistency
Selectively overwrite original KV cache entries

6. Auxiliary Model Fine-tuning

Fine-tune small auxiliary model to improve token selection accuracy
Significantly lower cost compared to fine-tuning primary model
Enhances overall CacheClip performance

System Architecture Design

Auxiliary model runs on CPU (utilizing idle head node CPU resources)
Supports Intel AMX accelerators for matrix operation acceleration
Token selection parallelized with primary model KV cache loading, hiding latency
Supports runtime dynamic adjustment of recomputation ratio

Experimental Setup

Datasets

RULER: Extended Needle-In-A-Haystack (NIAH) for retrieval category
- Contains 8 challenging variants (excluding niah_multikey2/3)
- Test sequence length: 8K tokens
- Evaluation metric: Average Reference Coverage (ARC)
LongBench: Long-context understanding benchmark
- Uses multifieldqa_zh, 2wikimqa, hotpotqa datasets
- Evaluation metrics: ROUGE-L and F1 scores

Experimental Configuration

Primary Model: Qwen2.5-14B
Auxiliary Model: SmolLM2-135M (fine-tuned)
Hardware: NVIDIA L20 GPU + Intel Xeon EMR CPU
Document Chunking: 1000 tokens with 50 tokens overlap

Comparison Methods

Full Attention: Complete attention computation (upper bound)
Direct Reuse: Direct KV cache concatenation
APE: Shared prefix + attention temperature adjustment
CacheBlend: Selective recomputation based on early layers

Experimental Results

Main Performance Comparison

RULER Dataset Results

CacheClip vs CacheBlend (20% recomputation ratio):
- Average performance: 94.50% vs 69.94%, improvement of 35.1%
- On multivalue tasks: 96% vs 42.97%, significant improvement
CacheClip vs APE:
- Average performance: 94.50% vs 75.5%, improvement of 25.2%
Compared to Full Attention: Maintains 94.8% performance

LongBench Dataset Results

Method	multifieldqa_zh	2wikimqa	hotpotqa
Full Attention	64.93	54.36	59.71
CacheClip	58.05	42.77	51.32
CacheBlend	57.34	41.08	44.11
APE	59.70	38.34	45.29

Efficiency Gains

Prefill Acceleration: 1.92× (20% recomputation ratio)
Latency Breakdown:
- Token selection: 0.238s
- Recomputation: 2.643s
- Other overhead: 0.070s
- Total time: 2.961s vs baseline 5.641s

Ablation Study Analysis

Impact of Recomputation Ratio

RULER-multivalue: Performance monotonically increases with recomputation ratio, validating effectiveness of selective recomputation
RULER-single2/3: CacheBlend shows performance degradation at moderate recomputation ratios, while CacheClip avoids this through grouping strategy

Auxiliary Model Effectiveness Verification

Proves that small auxiliary models can effectively approximate large models' attention patterns through attention distribution similarity analysis (KL divergence, Jaccard index).

Case Analysis

In the RULER-single2 task, CacheBlend outputs "566362" instead of the correct answer "5663623" because only partial tokens are recomputed. CacheClip's grouping strategy ensures complete digits are processed together, avoiding such errors.

KV Cache Management

Fine-tuning Methods: Block Attention, TurboRAG, KVLink adapt to local attention through fine-tuning, but incur high costs and require high-quality datasets
Cache Calibration: APE and Zhang et al. improve attention consistency through shared prefixes
Selective Recomputation: CacheBlend selects tokens based on early-layer signals, Cache-Craft stores multiple cache versions

CacheClip successfully resolves the efficiency-quality trade-off in RAG systems
Auxiliary model-guided token selection strategy is effective and efficient
Grouping strategy is crucial for maintaining context completeness
System design avoids additional GPU overhead, demonstrating practical value

Limitations

Current experiments primarily validate on 8K sequence lengths; performance on longer sequences requires further verification
Optimal matching strategies between auxiliary and primary models remain to be explored
Generalization capability across different domains and task types needs validation

Future Directions

Extend to longer sequences and more model architectures
Optimize auxiliary model selection and fine-tuning strategies
Explore dynamic recomputation ratio adjustment algorithms
Investigate system optimization in multi-GPU environments

In-Depth Evaluation

Strengths

Strong Technical Innovation: Novel approach of auxiliary model-guided token selection with solid theoretical foundation
Comprehensive Experimental Design: Covers multiple datasets with detailed ablation studies and case analyses
High Practical Value: Provides complete system design considering real deployment constraints
Significant Performance Gains: Achieves nearly 2× acceleration while maintaining high quality

Weaknesses

Limited Evaluation Scope: Primarily tested on 8K sequences, lacking validation on ultra-long sequences
Auxiliary Model Overhead: Despite CPU usage, still increases system complexity
Insufficient Generalization Verification: Primarily validated on specific model combinations; cross-architecture generalization unclear

Impact

Academic Contribution: Provides new technical pathway for RAG system optimization
Practical Value: Directly applicable to production environments, addressing real pain points
Reproducibility: Clear method description with sufficient implementation details

Applicable Scenarios

Interactive RAG applications requiring fast response
High-concurrency RAG service systems
Resource-constrained deployments requiring quality maintenance
Complex query scenarios requiring cross-document reasoning

References

The paper cites 44 related works covering multiple domains including LLM inference optimization, attention mechanisms, and RAG systems, providing solid theoretical foundation for this work.