Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.
- Paper ID: 2510.10129
- Title: CacheClip: Accelerating RAG with Effective KV Cache Reuse
- Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu (Intel Corporation)
- Categories: cs.LG cs.AI
- Publication Date: October 14, 2025
- Paper Link: https://arxiv.org/abs/2510.10129v1
Retrieval-Augmented Generation (RAG) systems suffer from severe Time-To-First-Token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face fundamental trade-offs: prefix caching requires identical prefixes but rarely occurs in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and redundant attention aggregation. This paper proposes the CacheClip framework, which achieves fast TTFT and high generation quality through auxiliary model-guided token selection, shared prefix elimination of redundant attention aggregation, and grouping strategies to maintain local consistency. Experiments demonstrate that CacheClip maintains 94.8% and 85.0% of full attention performance on NIAH and LongBench respectively, achieving up to 1.92× acceleration in prefill time.
The core problem faced by RAG systems is the Time-To-First-Token (TTFT) bottleneck. Due to the need to process large numbers of retrieved document chunks (typically 4K-16K tokens), the attention computation in the prefill phase exhibits quadratic complexity, resulting in poor user experience. For example, processing 200K input tokens on an A100 GPU requires over 20 seconds of TTFT.
- Prefix Caching: Requires completely identical prefixes, but retrieved chunks frequently change in RAG scenarios, resulting in low actual reuse rates
- Direct Precomputation: Independently computes KV caches for each chunk and concatenates them, with two critical issues:
- Missing inter-chunk attention, affecting cross-document reasoning
- Redundant attention aggregation effect, mismatched with training-time attention distributions
- Existing Improvements:
- APE: Only addresses attention aggregation, cannot recover inter-chunk attention
- CacheBlend: Selects tokens based on early layers, potentially missing critical tokens in deeper layers
There is a need for a method that can significantly accelerate inference while maintaining generation quality, particularly in complex RAG tasks requiring cross-document reasoning.
- Key Observation: Discovers that the last-layer attention distribution of small auxiliary LLMs is highly similar to large primary models, enabling efficient identification of important tokens
- CacheClip Framework: A novel framework integrating three techniques:
- Auxiliary model-guided token selection for selective KV cache recomputation
- Shared prefix elimination of redundant attention aggregation
- Grouping strategies to maintain local consistency
- Performance Gains: Achieves 94.8% and 85.0% of full attention performance on NIAH and LongBench respectively, while delivering 1.92× prefill acceleration
- Practical System Design: Auxiliary model runs on CPU, avoiding additional GPU overhead
Given a user query and a set of retrieved document chunks, the objective is to minimize prefill latency while maintaining generation quality. The input consists of query q and document chunk set {D₁, D₂, ..., Dₙ}, with the output being a high-quality response.
- Problem: Independently processed document chunks exhibit attention aggregation effects at the beginning
- Solution: Add a shared prefix (e.g., system prompt) to each chunk, retaining only the first chunk's prefix during concatenation
- Effect: Recovers global attention distribution consistent with training time
- Problem: Concatenated position IDs exhibit repetitive patterns
- Solution: Reassign continuously increasing position IDs
- Implementation: Reorder from
[0,1,2,...,sink_size,sink_size+1,...,sink_size+chunk1_size,sink_size+1,...]
to [0,1,2,...,sink_size,sink_size+1,...,sink_size+chunk1_size,sink_size+chunk1_size+1,...]
- Core Insight: The last-layer attention of small auxiliary models (e.g., SmolLM2-135M) is highly similar to large primary models (e.g., Qwen2.5-14B)
- Quantitative Verification:
- KL Divergence: KL divergence between auxiliary and primary models' last layers < KL divergence between primary model's first and last layers
- Jaccard Index: Higher overlap in top-20% important tokens
- Selection Strategy:
- Precompute KV caches for each chunk in the auxiliary model
- Concatenate chunks with query for batch processing
- Extract last-layer attention matrix, computing attention weights of query tokens over chunk tokens
- Average across query dimension to obtain importance scores for each token
- Motivation: Avoid sparse KV cache updates disrupting context completeness
- Implementation:
- Partition sequence into small windows (default 8 tokens)
- If selected tokens in window exceed threshold (default 5), recompute the window
- Otherwise skip the window, maintaining local context consistency
- Handle tokenizer differences between auxiliary and primary models
- Recompute KV caches for selected segments, maintaining position ID consistency
- Selectively overwrite original KV cache entries
- Fine-tune small auxiliary model to improve token selection accuracy
- Significantly lower cost compared to fine-tuning primary model
- Enhances overall CacheClip performance
- Auxiliary model runs on CPU (utilizing idle head node CPU resources)
- Supports Intel AMX accelerators for matrix operation acceleration
- Token selection parallelized with primary model KV cache loading, hiding latency
- Supports runtime dynamic adjustment of recomputation ratio
- RULER: Extended Needle-In-A-Haystack (NIAH) for retrieval category
- Contains 8 challenging variants (excluding niah_multikey2/3)
- Test sequence length: 8K tokens
- Evaluation metric: Average Reference Coverage (ARC)
- LongBench: Long-context understanding benchmark
- Uses multifieldqa_zh, 2wikimqa, hotpotqa datasets
- Evaluation metrics: ROUGE-L and F1 scores
- Primary Model: Qwen2.5-14B
- Auxiliary Model: SmolLM2-135M (fine-tuned)
- Hardware: NVIDIA L20 GPU + Intel Xeon EMR CPU
- Document Chunking: 1000 tokens with 50 tokens overlap
- Full Attention: Complete attention computation (upper bound)
- Direct Reuse: Direct KV cache concatenation
- APE: Shared prefix + attention temperature adjustment
- CacheBlend: Selective recomputation based on early layers
- CacheClip vs CacheBlend (20% recomputation ratio):
- Average performance: 94.50% vs 69.94%, improvement of 35.1%
- On multivalue tasks: 96% vs 42.97%, significant improvement
- CacheClip vs APE:
- Average performance: 94.50% vs 75.5%, improvement of 25.2%
- Compared to Full Attention: Maintains 94.8% performance
| Method | multifieldqa_zh | 2wikimqa | hotpotqa |
|---|
| Full Attention | 64.93 | 54.36 | 59.71 |
| CacheClip | 58.05 | 42.77 | 51.32 |
| CacheBlend | 57.34 | 41.08 | 44.11 |
| APE | 59.70 | 38.34 | 45.29 |
- Prefill Acceleration: 1.92× (20% recomputation ratio)
- Latency Breakdown:
- Token selection: 0.238s
- Recomputation: 2.643s
- Other overhead: 0.070s
- Total time: 2.961s vs baseline 5.641s
- RULER-multivalue: Performance monotonically increases with recomputation ratio, validating effectiveness of selective recomputation
- RULER-single2/3: CacheBlend shows performance degradation at moderate recomputation ratios, while CacheClip avoids this through grouping strategy
Proves that small auxiliary models can effectively approximate large models' attention patterns through attention distribution similarity analysis (KL divergence, Jaccard index).
In the RULER-single2 task, CacheBlend outputs "566362" instead of the correct answer "5663623" because only partial tokens are recomputed. CacheClip's grouping strategy ensures complete digits are processed together, avoiding such errors.
- Fine-tuning Methods: Block Attention, TurboRAG, KVLink adapt to local attention through fine-tuning, but incur high costs and require high-quality datasets
- Cache Calibration: APE and Zhang et al. improve attention consistency through shared prefixes
- Selective Recomputation: CacheBlend selects tokens based on early-layer signals, Cache-Craft stores multiple cache versions
H2O, Quest, PyramidKV and other methods identify important tokens during decoding phase, providing inspiration for token selection in the prefill phase.
- CacheClip successfully resolves the efficiency-quality trade-off in RAG systems
- Auxiliary model-guided token selection strategy is effective and efficient
- Grouping strategy is crucial for maintaining context completeness
- System design avoids additional GPU overhead, demonstrating practical value
- Current experiments primarily validate on 8K sequence lengths; performance on longer sequences requires further verification
- Optimal matching strategies between auxiliary and primary models remain to be explored
- Generalization capability across different domains and task types needs validation
- Extend to longer sequences and more model architectures
- Optimize auxiliary model selection and fine-tuning strategies
- Explore dynamic recomputation ratio adjustment algorithms
- Investigate system optimization in multi-GPU environments
- Strong Technical Innovation: Novel approach of auxiliary model-guided token selection with solid theoretical foundation
- Comprehensive Experimental Design: Covers multiple datasets with detailed ablation studies and case analyses
- High Practical Value: Provides complete system design considering real deployment constraints
- Significant Performance Gains: Achieves nearly 2× acceleration while maintaining high quality
- Limited Evaluation Scope: Primarily tested on 8K sequences, lacking validation on ultra-long sequences
- Auxiliary Model Overhead: Despite CPU usage, still increases system complexity
- Insufficient Generalization Verification: Primarily validated on specific model combinations; cross-architecture generalization unclear
- Academic Contribution: Provides new technical pathway for RAG system optimization
- Practical Value: Directly applicable to production environments, addressing real pain points
- Reproducibility: Clear method description with sufficient implementation details
- Interactive RAG applications requiring fast response
- High-concurrency RAG service systems
- Resource-constrained deployments requiring quality maintenance
- Complex query scenarios requiring cross-document reasoning
The paper cites 44 related works covering multiple domains including LLM inference optimization, attention mechanisms, and RAG systems, providing solid theoretical foundation for this work.