Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Feng, Lv, Cao et al.
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.
academic
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Large Language Models (LLMs) demonstrate exceptional performance across various domains but face efficiency challenges due to the growing Key-Value (KV) cache requirements in long-sequence inference. Recent research has reduced KV cache size by evicting large quantities of non-critical cache elements at runtime while maintaining generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, overlooking the unique attention patterns of individual heads. This paper establishes a theoretical upper bound on the loss between attention outputs before and after eviction, explaining the optimization objectives of previous cache eviction methods while guiding adaptive budget allocation optimization. Based on this foundation, the authors propose Ada-KV, the first head-level adaptive budget allocation strategy. This method offers plug-and-play advantages, enabling seamless integration with existing cache eviction methods.
As Large Language Models process increasingly longer sequences (e.g., GPT supporting 128K, Claude3 supporting 200K, Gemini-Pro-1.5 supporting 2M tokens), the memory requirements for KV caches grow exponentially. For an 8B-parameter LLM, processing a single 2M token sequence may require up to 256GB of cache, severely impacting GPU memory efficiency and computational runtime performance.
Existing cache eviction methods fall into two main categories:
Sliding Window Eviction Methods: Simply retain initial and recent cache elements but significantly degrade generation quality
Top-k Eviction Methods: Select critical cache elements based on attention weights but allocate budgets uniformly across all attention heads
The key issue is that existing methods overlook the unique characteristics of different attention heads: some heads exhibit sparse, concentrated attention patterns, while others have more dispersed attention distributions.
Through analysis of the Llama-3.1-8B-Instruct model, the authors discovered that most attention heads can retain nearly all attention weights using only a small cache proportion (e.g., top 5%), while dispersed heads require larger cache proportions. This non-uniform attention concentration pattern provides a theoretical foundation for adaptive budget allocation.
Adaptive Budget Allocation Strategy: Proposes Ada-KV, the first head-level adaptive budget allocation strategy that dynamically adjusts budget allocation according to the unique attention patterns of individual attention heads
Theoretical Framework Establishment: Establishes a theoretical framework for cache eviction, defines eviction loss and derives its upper bound, explaining the optimization objectives of existing methods and guiding Ada-KV's design
Plug-and-Play Compatibility: Ada-KV possesses plug-and-play characteristics, enabling seamless integration into existing cache eviction methods while maintaining computational efficiency through efficient CUDA kernel implementation
Comprehensive Experimental Validation: Conducts comprehensive evaluation on 29 datasets from Ruler and LongBench, demonstrating significant improvements in both question-aware and question-agnostic scenarios
Given a multi-head self-attention layer, select KV cache elements to retain under budget constraints such that the loss between attention outputs after and before eviction is minimized.
Input: Total budget B, attention weights of each head {A_i}
Output: Allocated budgets {B_i^*}
1. Concatenate attention weights of all heads: A = Cat({A_i})
2. Select top B weights from A: Top-k(A, k=B)
3. Count the number of selected weights for each head: {f_i}
4. Set allocated budgets: {B_i^* = f_i}
Ada-KV is the first to propose a head-level adaptive budget allocation strategy, significantly improving the performance of existing cache eviction methods
Theoretical analysis establishes a rigorous framework for cache eviction, guiding algorithm design
The question-agnostic compression scenario reveals limitations of existing methods and deserves more attention
Overall Assessment: This is a high-quality research paper that achieves good balance between theoretical contribution and practical value. The Ada-KV method is simple yet effective, with rigorous theoretical analysis and comprehensive experimental validation. The paper not only addresses important limitations of existing methods but also provides valuable frameworks and directions for future research.