2025-11-16T09:58:12.370377

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Feng, Lv, Cao et al.
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.
academic

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Basic Information

  • Paper ID: 2407.11550
  • Title: Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
  • Authors: Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou
  • Classification: cs.CL cs.AI
  • Publication Time/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
  • Paper Link: https://arxiv.org/abs/2407.11550

Abstract

Large Language Models (LLMs) demonstrate exceptional performance across various domains but face efficiency challenges due to the growing Key-Value (KV) cache requirements in long-sequence inference. Recent research has reduced KV cache size by evicting large quantities of non-critical cache elements at runtime while maintaining generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, overlooking the unique attention patterns of individual heads. This paper establishes a theoretical upper bound on the loss between attention outputs before and after eviction, explaining the optimization objectives of previous cache eviction methods while guiding adaptive budget allocation optimization. Based on this foundation, the authors propose Ada-KV, the first head-level adaptive budget allocation strategy. This method offers plug-and-play advantages, enabling seamless integration with existing cache eviction methods.

Research Background and Motivation

Problem Description

As Large Language Models process increasingly longer sequences (e.g., GPT supporting 128K, Claude3 supporting 200K, Gemini-Pro-1.5 supporting 2M tokens), the memory requirements for KV caches grow exponentially. For an 8B-parameter LLM, processing a single 2M token sequence may require up to 256GB of cache, severely impacting GPU memory efficiency and computational runtime performance.

Limitations of Existing Methods

Existing cache eviction methods fall into two main categories:

  1. Sliding Window Eviction Methods: Simply retain initial and recent cache elements but significantly degrade generation quality
  2. Top-k Eviction Methods: Select critical cache elements based on attention weights but allocate budgets uniformly across all attention heads

The key issue is that existing methods overlook the unique characteristics of different attention heads: some heads exhibit sparse, concentrated attention patterns, while others have more dispersed attention distributions.

Research Motivation

Through analysis of the Llama-3.1-8B-Instruct model, the authors discovered that most attention heads can retain nearly all attention weights using only a small cache proportion (e.g., top 5%), while dispersed heads require larger cache proportions. This non-uniform attention concentration pattern provides a theoretical foundation for adaptive budget allocation.

Core Contributions

  1. Adaptive Budget Allocation Strategy: Proposes Ada-KV, the first head-level adaptive budget allocation strategy that dynamically adjusts budget allocation according to the unique attention patterns of individual attention heads
  2. Theoretical Framework Establishment: Establishes a theoretical framework for cache eviction, defines eviction loss and derives its upper bound, explaining the optimization objectives of existing methods and guiding Ada-KV's design
  3. Plug-and-Play Compatibility: Ada-KV possesses plug-and-play characteristics, enabling seamless integration into existing cache eviction methods while maintaining computational efficiency through efficient CUDA kernel implementation
  4. Comprehensive Experimental Validation: Conducts comprehensive evaluation on 29 datasets from Ruler and LongBench, demonstrating significant improvements in both question-aware and question-agnostic scenarios

Methodology Details

Task Definition

Given a multi-head self-attention layer, select KV cache elements to retain under budget constraints such that the loss between attention outputs after and before eviction is minimized.

Theoretical Foundation

L1 Eviction Loss Definition

The authors quantify eviction loss as the L1 distance between self-attention mechanism outputs before and after eviction:

L1 Eviction Loss=yy^1\text{L1 Eviction Loss} = ||y - \hat{y}||_1

where yy and y^\hat{y} are the attention outputs before and after eviction, respectively.

Loss Upper Bound Derivation

Theorem 3.1: The L1 eviction loss can be bounded by ϵ\epsilon:

L1 Eviction Lossϵ=2hC2Ci[1,h]j[1,n]IijAij\text{L1 Eviction Loss} \leq \epsilon = 2hC - 2C\sum_{i \in [1,h]}\sum_{j \in [1,n]} I_i^j A_i^j

where C=max{ViWiO}C = \max\{\|V_iW_i^O\|_\infty\} is a constant, IijI_i^j is the eviction decision indicator variable, and AijA_i^j is the attention weight.

Theorem 3.2: The Top-k cache eviction method minimizes the loss upper bound under a given budget allocation:

ϵ=2hC2Ci[1,h]AijTop-k(Ai,k=Bi)Aij\epsilon^* = 2hC - 2C\sum_{i \in [1,h]}\sum_{A_i^j \in \text{Top-k}(A_i, k=B_i)} A_i^j

Ada-KV Algorithm

Algorithm 1: Adaptive Budget Allocation

Input: Total budget B, attention weights of each head {A_i}
Output: Allocated budgets {B_i^*}
1. Concatenate attention weights of all heads: A = Cat({A_i})
2. Select top B weights from A: Top-k(A, k=B)
3. Count the number of selected weights for each head: {f_i}
4. Set allocated budgets: {B_i^* = f_i}

Theoretical Advantages

Theorem 3.3: Adaptive budget allocation achieves the minimum loss upper bound:

ϵ=min{Bi}ϵ\epsilon^{**} = \min_{\{B_i\}} \epsilon^*

Integration with Existing Methods

The authors demonstrate Ada-KV's integration with two SOTA methods:

Ada-SnapKV and Ada-Pyramid

Through Algorithm 2, Ada-KV can be seamlessly integrated into SnapKV and Pyramid:

  1. Compute attention weights within the observation window
  2. Use Ada-KV algorithm to allocate budgets
  3. Apply safety protection parameter α = 0.2 to prevent over-sparse allocation
  4. Execute Top-k eviction decisions

Technical Innovations

  1. Global Optimization Perspective: Treats head-level budget allocation as a global optimization problem rather than local optimization
  2. Theory-Guided Design: Guides algorithm design based on rigorous theoretical analysis
  3. Computational Efficiency Guarantee: Maintains computational efficiency through variable-length FlashAttention and flattened cache layout
  4. GQA Compatibility: Supports Group Query Attention for additional cache compression

Experimental Setup

Datasets

  • Ruler Benchmark: 13 long-sequence tasks, primarily variants of Needle-in-a-Haystack tests, evaluating 16K length
  • LongBench Benchmark: 16 datasets covering single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code generation

Base Models

  • Llama-3.1-8B-Instruct
  • Mistral-7B-instruct-v0.2

Evaluation Metrics

Use appropriate metrics based on task type: F1 score (QA tasks), Rouge-L (summarization tasks), accuracy (classification tasks), edit similarity (code tasks)

Comparison Methods

  • Baseline Methods: SnapKV, Pyramid, StreamingLLM
  • Enhanced Versions: Ada-SnapKV, Ada-Pyramid

Experimental Scenarios

  • Question-Aware Compression: Standard scenario with known questions
  • Question-Agnostic Compression: More challenging real-world application scenario

Experimental Results

Main Results

Ruler Benchmark Testing

In the question-agnostic scenario using Llama-3.1-8B-Instruct:

  • 80% cache budget: Ada-SnapKV improves SnapKV's score from 87.59 to 92.67
  • 20% cache budget: Ada-SnapKV improves SnapKV's score from 44.02 to 53.29

LongBench Benchmark Testing

In the question-agnostic scenario:

  • Ada-SnapKV and Ada-Pyramid consistently improve generation quality under all fixed budget settings
  • Approach near-lossless performance at 2048 budget

Sub-task Analysis

In the challenging Needle-in-a-Haystack task:

  • S-NIAH-3 task (80% budget): Ada-SnapKV improves SnapKV from 62.4 to 97.6
  • MK-NIAH-2 task (80% budget): Ada-SnapKV improves SnapKV from 85.2 to 99.6

Computational Efficiency

Ada-SnapKV under fixed 1024 budget:

  • Peak memory usage comparable to original SnapKV
  • Decoding latency comparable to original SnapKV
  • Both significantly outperform full cache scenarios

Broad Application Verification

Ada-KV strategy has been adopted by multiple subsequent works:

  • CriticalKV + Ada-KV: Improves from 42.99 to 43.77 at 20% cache
  • DefensiveKV + Ada-KV: Improves from 43.78 to 46.68 at 20% cache

Cache Eviction Methods

  • Sliding Window Methods: StreamingLLM, etc., simple but with large quality loss
  • Top-k Methods: H2O, SnapKV, Pyramid, etc., select critical elements based on attention weights

Sparse Attention Methods

Conceptually related to cache eviction but methodologically different:

  • Cache Eviction: Retain only a subset of KV cache
  • Sparse Attention: Retain all entries but selectively use them
  • KV Cache Quantization: Reduce precision of individual elements
  • Speculative Decoding: Generate drafts using models with reduced cache
  • Paged Attention: Efficient memory management

Conclusions and Discussion

Main Conclusions

  1. Ada-KV is the first to propose a head-level adaptive budget allocation strategy, significantly improving the performance of existing cache eviction methods
  2. Theoretical analysis establishes a rigorous framework for cache eviction, guiding algorithm design
  3. The question-agnostic compression scenario reveals limitations of existing methods and deserves more attention

Limitations

  1. Current head-level allocation is limited to within-layer scenarios, not extended to cross-layer allocation
  2. Safety protection parameter α requires trade-offs between performance under different budgets
  3. Theoretical analysis is based on L1 distance, which may not fully reflect actual generation quality

Future Directions

  1. Extend head-level allocation mechanisms to cross-layer scenarios
  2. Develop corresponding cross-layer theoretical analysis
  3. Incorporate training-time head importance analysis
  4. Joint optimization with other techniques (e.g., quantization, sparse attention)

In-Depth Evaluation

Strengths

  1. Solid Theoretical Contribution: Establishes a complete theoretical framework with clear logic from loss upper bound derivation to algorithm design
  2. Simple and Effective Method: The algorithm is concise and understandable, with plug-and-play characteristics enabling easy adoption
  3. Comprehensive Experiments: Thorough evaluation on 29 datasets, including the previously overlooked question-agnostic scenario
  4. High Practical Value: Already adopted by multiple subsequent works, demonstrating the method's value and impact

Weaknesses

  1. Gap Between Theory and Practice: While theoretically minimizing the loss upper bound, it cannot guarantee actual loss minimization
  2. Hyperparameter Sensitivity: Selection of safety protection parameter α requires empirical tuning
  3. Scalability Limitations: Currently only considers budget allocation within a single layer
  4. Evaluation Limitations: Primarily evaluated on medium-scale models; effectiveness on large-scale models remains to be verified

Impact

  1. Academic Contribution: Provides new research directions for the KV cache optimization field
  2. Practical Value: Plug-and-play characteristics enable easy deployment in real systems
  3. Reproducibility: Provides open-source code and detailed implementation details
  4. Inspirational Value: Provides theoretical framework and methodological guidance for subsequent research

Applicable Scenarios

  1. Long-Sequence Inference: Particularly suitable for applications requiring long context processing
  2. Resource-Constrained Environments: Optimizes inference efficiency when GPU memory is limited
  3. Real-Time Systems: Balances quality and efficiency in online services
  4. Multi-Turn Dialogue: Question-agnostic compression scenario particularly suits dialogue systems

References

The paper cites 64 related references, primarily including:

  • Foundational Large Language Model works (GPT-4, Claude, Gemini, etc.)
  • KV Cache Optimization Methods (H2O, SnapKV, Pyramid, etc.)
  • Attention Mechanism Optimization (FlashAttention, sparse attention, etc.)
  • Long-Sequence Processing Benchmarks (Ruler, LongBench, etc.)

Overall Assessment: This is a high-quality research paper that achieves good balance between theoretical contribution and practical value. The Ada-KV method is simple yet effective, with rigorous theoretical analysis and comprehensive experimental validation. The paper not only addresses important limitations of existing methods but also provides valuable frameworks and directions for future research.