FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Lu, Chen, Chang et al.
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
academic
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Although large language models have achieved exceptional performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce memory usage and computational requirements; however, applying uniform compression ratios across all layers often leads to significant performance degradation, and existing methods perform poorly during the decoding phase. To address these issues, this paper proposes Fine-grained Low-Rank Compressor (FLRC), which efficiently determines optimal rank allocation for each layer and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate FLRC's superiority, achieving up to 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods.
The core challenges faced by large language models (LLMs) are:
Deployment Difficulty: Enormous parameter counts and high computational requirements make deployment on resource-constrained environments such as mobile devices and edge servers challenging
Suboptimal Compression: Existing low-rank compression methods employ uniform compression ratios, neglecting the varying tolerance for compression across different layers
Decoding Performance Degradation: Existing methods primarily focus on the prefilling phase, with significant performance decline in multi-turn decoding tasks such as text summarization
Practical Deployment Needs: With the proliferation of LLM applications, the demand for efficient deployment on resource-constrained devices is increasingly urgent
Limitations of Existing Methods: Uniform compression strategies fail to fully exploit the heterogeneity of model structure
Generation Quality Assurance: Text generation tasks require high continuous decoding quality, necessitating specialized optimization strategies
Proposes Fisher-based Layer-wise Rank Allocation Algorithm: Based on importance measures of gradients and weights, determines optimal rank allocation for each projection layer, achieving 49× speedup in search time compared to ASVD
Introduces Progressive Low-Rank Decoding Mechanism: Dynamically adjusts rank allocation during decoding, allocating more parameters to early tokens and gradually reducing for later tokens, improving compression rate while maintaining generation quality
Establishes Fine-grained Compression Framework: Combines layer-wise rank allocation with progressive decoding, forming a comprehensive LLM compression solution
Achieves Significant Performance Improvements: Achieves up to 17.35% ROUGE-L improvement on summarization tasks compared to existing methods, while maintaining excellent performance on understanding tasks
Input: Pre-trained large language model M, target compression ratio
Output: Compressed model that reduces parameters and computational overhead while maintaining generation quality
Constraints: Maximize model performance within given parameter budget
The core idea of this algorithm is to assign different ranks to each projection layer in the model based on their importance, enabling differentiated compression.
Importance Calculation:
For each projection p in layer l, the importance measure is defined as:
αl,p = Σi (Gl,p[i] × Wl,p[i])²
where Gl,p is the gradient and Wl,p is the weight parameter.
Rank Allocation Strategy:
rl,p = round(αl,p/S × Rbudget)
where S is the total importance score and Rbudget is the total rank budget.
Algorithm Flow:
Compute gradients for each projection layer using calibration dataset
Calculate importance scores based on gradients and weights
Application of Fisher Information Criterion: Combines gradient and weight information to assess projection importance, more accurate than methods based solely on weight magnitude or gradients
Dynamic Compression Paradigm: Transcends static compression limitations by dynamically adjusting compression rate according to generation process characteristics
Fine-grained Optimization: Performs optimization at projection level rather than layer level, enabling more precise resource allocation
End-to-end Framework: Unifies rank allocation and dynamic decoding in a single framework for coordinated optimization
Calibration Data Dependency: Method performance is influenced by calibration dataset selection, with different datasets potentially leading to performance variations
The paper cites abundant related work, primarily including:
Yuan et al., 2023 - ASVD method
Wang et al., 2024 - SVD-LLM method
Touvron et al., 2023 - LLaMA model series
Multiple references for benchmark datasets and evaluation tools
Overall Assessment: This is a high-quality research paper that proposes innovative solutions to key problems in the LLM compression field. The method design is sound, experimental validation is comprehensive, results are significant, and it possesses important academic and practical value. While there is room for improvement in theoretical analysis and hardware optimization, overall it represents an important contribution to the field.