2025-11-16T09:58:12.370377

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Feng, Lv, Cao et al.

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.

academic

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Basic Information

Paper ID: 2407.11550
Title: Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Authors: Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou
Classification: cs.CL cs.AI
Publication Time/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: https://arxiv.org/abs/2407.11550

Abstract

Large Language Models (LLMs) demonstrate exceptional performance across various domains but face efficiency challenges due to the growing Key-Value (KV) cache requirements in long-sequence inference. Recent research has reduced KV cache size by evicting large quantities of non-critical cache elements at runtime while maintaining generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, overlooking the unique attention patterns of individual heads. This paper establishes a theoretical upper bound on the loss between attention outputs before and after eviction, explaining the optimization objectives of previous cache eviction methods while guiding adaptive budget allocation optimization. Based on this foundation, the authors propose Ada-KV, the first head-level adaptive budget allocation strategy. This method offers plug-and-play advantages, enabling seamless integration with existing cache eviction methods.

Research Background and Motivation

Problem Description

As Large Language Models process increasingly longer sequences (e.g., GPT supporting 128K, Claude3 supporting 200K, Gemini-Pro-1.5 supporting 2M tokens), the memory requirements for KV caches grow exponentially. For an 8B-parameter LLM, processing a single 2M token sequence may require up to 256GB of cache, severely impacting GPU memory efficiency and computational runtime performance.

Limitations of Existing Methods

Existing cache eviction methods fall into two main categories:

Sliding Window Eviction Methods: Simply retain initial and recent cache elements but significantly degrade generation quality
Top-k Eviction Methods: Select critical cache elements based on attention weights but allocate budgets uniformly across all attention heads

The key issue is that existing methods overlook the unique characteristics of different attention heads: some heads exhibit sparse, concentrated attention patterns, while others have more dispersed attention distributions.

Research Motivation

Through analysis of the Llama-3.1-8B-Instruct model, the authors discovered that most attention heads can retain nearly all attention weights using only a small cache proportion (e.g., top 5%), while dispersed heads require larger cache proportions. This non-uniform attention concentration pattern provides a theoretical foundation for adaptive budget allocation.

Core Contributions

Adaptive Budget Allocation Strategy: Proposes Ada-KV, the first head-level adaptive budget allocation strategy that dynamically adjusts budget allocation according to the unique attention patterns of individual attention heads
Theoretical Framework Establishment: Establishes a theoretical framework for cache eviction, defines eviction loss and derives its upper bound, explaining the optimization objectives of existing methods and guiding Ada-KV's design
Plug-and-Play Compatibility: Ada-KV possesses plug-and-play characteristics, enabling seamless integration into existing cache eviction methods while maintaining computational efficiency through efficient CUDA kernel implementation
Comprehensive Experimental Validation: Conducts comprehensive evaluation on 29 datasets from Ruler and LongBench, demonstrating significant improvements in both question-aware and question-agnostic scenarios

Methodology Details

Task Definition

Given a multi-head self-attention layer, select KV cache elements to retain under budget constraints such that the loss between attention outputs after and before eviction is minimized.

Theoretical Foundation

L1 Eviction Loss Definition

The authors quantify eviction loss as the L1 distance between self-attention mechanism outputs before and after eviction:

$\text{L1 Eviction Loss} = ||y - \hat{y}||_1$

where $y$ and $\hat{y}$ are the attention outputs before and after eviction, respectively.

Loss Upper Bound Derivation

Theorem 3.1: The L1 eviction loss can be bounded by $\epsilon$ :

$\text{L1 Eviction Loss} \leq \epsilon = 2hC - 2C\sum_{i \in [1,h]}\sum_{j \in [1,n]} I_i^j A_i^j$

where $C = \max\{\|V_iW_i^O\|_\infty\}$ is a constant, $I_i^j$ is the eviction decision indicator variable, and $A_i^j$ is the attention weight.

Theorem 3.2: The Top-k cache eviction method minimizes the loss upper bound under a given budget allocation:

$\epsilon^* = 2hC - 2C\sum_{i \in [1,h]}\sum_{A_i^j \in \text{Top-k}(A_i, k=B_i)} A_i^j$

Ada-KV Algorithm

Algorithm 1: Adaptive Budget Allocation

Input: Total budget B, attention weights of each head {A_i}
Output: Allocated budgets {B_i^*}
1. Concatenate attention weights of all heads: A = Cat({A_i})
2. Select top B weights from A: Top-k(A, k=B)
3. Count the number of selected weights for each head: {f_i}
4. Set allocated budgets: {B_i^* = f_i}

Theoretical Advantages

Theorem 3.3: Adaptive budget allocation achieves the minimum loss upper bound:

$\epsilon^{**} = \min_{\{B_i\}} \epsilon^*$

Integration with Existing Methods

The authors demonstrate Ada-KV's integration with two SOTA methods:

Ada-SnapKV and Ada-Pyramid

Through Algorithm 2, Ada-KV can be seamlessly integrated into SnapKV and Pyramid:

Compute attention weights within the observation window
Use Ada-KV algorithm to allocate budgets
Apply safety protection parameter α = 0.2 to prevent over-sparse allocation
Execute Top-k eviction decisions

Technical Innovations

Global Optimization Perspective: Treats head-level budget allocation as a global optimization problem rather than local optimization
Theory-Guided Design: Guides algorithm design based on rigorous theoretical analysis
Computational Efficiency Guarantee: Maintains computational efficiency through variable-length FlashAttention and flattened cache layout
GQA Compatibility: Supports Group Query Attention for additional cache compression

Experimental Setup

Datasets

Ruler Benchmark: 13 long-sequence tasks, primarily variants of Needle-in-a-Haystack tests, evaluating 16K length
LongBench Benchmark: 16 datasets covering single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code generation

Base Models

Llama-3.1-8B-Instruct
Mistral-7B-instruct-v0.2

Evaluation Metrics

Use appropriate metrics based on task type: F1 score (QA tasks), Rouge-L (summarization tasks), accuracy (classification tasks), edit similarity (code tasks)

Comparison Methods

Baseline Methods: SnapKV, Pyramid, StreamingLLM
Enhanced Versions: Ada-SnapKV, Ada-Pyramid

Experimental Scenarios

Question-Aware Compression: Standard scenario with known questions
Question-Agnostic Compression: More challenging real-world application scenario

Experimental Results

Main Results

Ruler Benchmark Testing

In the question-agnostic scenario using Llama-3.1-8B-Instruct:

80% cache budget: Ada-SnapKV improves SnapKV's score from 87.59 to 92.67
20% cache budget: Ada-SnapKV improves SnapKV's score from 44.02 to 53.29

LongBench Benchmark Testing

In the question-agnostic scenario:

Ada-SnapKV and Ada-Pyramid consistently improve generation quality under all fixed budget settings
Approach near-lossless performance at 2048 budget

Sub-task Analysis

In the challenging Needle-in-a-Haystack task:

S-NIAH-3 task (80% budget): Ada-SnapKV improves SnapKV from 62.4 to 97.6
MK-NIAH-2 task (80% budget): Ada-SnapKV improves SnapKV from 85.2 to 99.6

Computational Efficiency

Ada-SnapKV under fixed 1024 budget:

Peak memory usage comparable to original SnapKV
Decoding latency comparable to original SnapKV
Both significantly outperform full cache scenarios

Broad Application Verification

Ada-KV strategy has been adopted by multiple subsequent works:

CriticalKV + Ada-KV: Improves from 42.99 to 43.77 at 20% cache
DefensiveKV + Ada-KV: Improves from 43.78 to 46.68 at 20% cache

Cache Eviction Methods

Sliding Window Methods: StreamingLLM, etc., simple but with large quality loss
Top-k Methods: H2O, SnapKV, Pyramid, etc., select critical elements based on attention weights

Sparse Attention Methods

Conceptually related to cache eviction but methodologically different:

Cache Eviction: Retain only a subset of KV cache
Sparse Attention: Retain all entries but selectively use them

KV Cache Quantization: Reduce precision of individual elements
Speculative Decoding: Generate drafts using models with reduced cache
Paged Attention: Efficient memory management

Conclusions and Discussion

Main Conclusions

Ada-KV is the first to propose a head-level adaptive budget allocation strategy, significantly improving the performance of existing cache eviction methods
Theoretical analysis establishes a rigorous framework for cache eviction, guiding algorithm design
The question-agnostic compression scenario reveals limitations of existing methods and deserves more attention

Limitations

Current head-level allocation is limited to within-layer scenarios, not extended to cross-layer allocation
Safety protection parameter α requires trade-offs between performance under different budgets
Theoretical analysis is based on L1 distance, which may not fully reflect actual generation quality

Future Directions

Extend head-level allocation mechanisms to cross-layer scenarios
Develop corresponding cross-layer theoretical analysis
Incorporate training-time head importance analysis
Joint optimization with other techniques (e.g., quantization, sparse attention)

In-Depth Evaluation

Strengths

Solid Theoretical Contribution: Establishes a complete theoretical framework with clear logic from loss upper bound derivation to algorithm design
Simple and Effective Method: The algorithm is concise and understandable, with plug-and-play characteristics enabling easy adoption
Comprehensive Experiments: Thorough evaluation on 29 datasets, including the previously overlooked question-agnostic scenario
High Practical Value: Already adopted by multiple subsequent works, demonstrating the method's value and impact

Weaknesses

Gap Between Theory and Practice: While theoretically minimizing the loss upper bound, it cannot guarantee actual loss minimization
Hyperparameter Sensitivity: Selection of safety protection parameter α requires empirical tuning
Scalability Limitations: Currently only considers budget allocation within a single layer
Evaluation Limitations: Primarily evaluated on medium-scale models; effectiveness on large-scale models remains to be verified

Impact

Academic Contribution: Provides new research directions for the KV cache optimization field
Practical Value: Plug-and-play characteristics enable easy deployment in real systems
Reproducibility: Provides open-source code and detailed implementation details
Inspirational Value: Provides theoretical framework and methodological guidance for subsequent research

Applicable Scenarios

Long-Sequence Inference: Particularly suitable for applications requiring long context processing
Resource-Constrained Environments: Optimizes inference efficiency when GPU memory is limited
Real-Time Systems: Balances quality and efficiency in online services
Multi-Turn Dialogue: Question-agnostic compression scenario particularly suits dialogue systems

References

The paper cites 64 related references, primarily including:

Foundational Large Language Model works (GPT-4, Claude, Gemini, etc.)
KV Cache Optimization Methods (H2O, SnapKV, Pyramid, etc.)
Attention Mechanism Optimization (FlashAttention, sparse attention, etc.)
Long-Sequence Processing Benchmarks (Ruler, LongBench, etc.)

Overall Assessment: This is a high-quality research paper that achieves good balance between theoretical contribution and practical value. The Ada-KV method is simple yet effective, with rigorous theoretical analysis and comprehensive experimental validation. The paper not only addresses important limitations of existing methods but also provides valuable frameworks and directions for future research.