2025-11-16T20:52:12.570613

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Lu, Chen, Chang et al.

Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

academic

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Basic Information

Paper ID: 2510.09332
Title: FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Authors: Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
Institutions: National Yang Ming Chiao Tung University, Macronix International Co., Ltd., Cornell University
Classification: cs.CL cs.AI
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09332

Abstract

Although large language models have achieved exceptional performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce memory usage and computational requirements; however, applying uniform compression ratios across all layers often leads to significant performance degradation, and existing methods perform poorly during the decoding phase. To address these issues, this paper proposes Fine-grained Low-Rank Compressor (FLRC), which efficiently determines optimal rank allocation for each layer and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate FLRC's superiority, achieving up to 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods.

Research Background and Motivation

Problem Definition

The core challenges faced by large language models (LLMs) are:

Deployment Difficulty: Enormous parameter counts and high computational requirements make deployment on resource-constrained environments such as mobile devices and edge servers challenging
Suboptimal Compression: Existing low-rank compression methods employ uniform compression ratios, neglecting the varying tolerance for compression across different layers
Decoding Performance Degradation: Existing methods primarily focus on the prefilling phase, with significant performance decline in multi-turn decoding tasks such as text summarization

Research Motivation

Practical Deployment Needs: With the proliferation of LLM applications, the demand for efficient deployment on resource-constrained devices is increasingly urgent
Limitations of Existing Methods: Uniform compression strategies fail to fully exploit the heterogeneity of model structure
Generation Quality Assurance: Text generation tasks require high continuous decoding quality, necessitating specialized optimization strategies

Core Contributions

Proposes Fisher-based Layer-wise Rank Allocation Algorithm: Based on importance measures of gradients and weights, determines optimal rank allocation for each projection layer, achieving 49× speedup in search time compared to ASVD
Introduces Progressive Low-Rank Decoding Mechanism: Dynamically adjusts rank allocation during decoding, allocating more parameters to early tokens and gradually reducing for later tokens, improving compression rate while maintaining generation quality
Establishes Fine-grained Compression Framework: Combines layer-wise rank allocation with progressive decoding, forming a comprehensive LLM compression solution
Achieves Significant Performance Improvements: Achieves up to 17.35% ROUGE-L improvement on summarization tasks compared to existing methods, while maintaining excellent performance on understanding tasks

Methodology Details

Task Definition

Input: Pre-trained large language model M, target compression ratio Output: Compressed model that reduces parameters and computational overhead while maintaining generation quality Constraints: Maximize model performance within given parameter budget

Model Architecture

1. Fisher-based Layer-wise Rank Allocation

The core idea of this algorithm is to assign different ranks to each projection layer in the model based on their importance, enabling differentiated compression.

Importance Calculation: For each projection p in layer l, the importance measure is defined as:

αl,p = Σi (Gl,p[i] × Wl,p[i])²

where Gl,p is the gradient and Wl,p is the weight parameter.

Rank Allocation Strategy:

rl,p = round(αl,p/S × Rbudget)

where S is the total importance score and Rbudget is the total rank budget.

Algorithm Flow:

Compute gradients for each projection layer using calibration dataset
Calculate importance scores based on gradients and weights
Allocate rank budget proportionally to importance
Generate layer-wise rank allocation scheme

2. Progressive Low-Rank Decoding

This mechanism is based on the observation that early tokens have greater impact on overall coherence and quality in text generation.

Dynamic Rank Adjustment:

rl,p(t) = round(αl,p/S × Rbudget(t))

where Rbudget(t) is the rank budget for the t-th token, satisfying non-increasing property.

Scheduling Strategy:

Early tokens: Use larger parameter sets to ensure generation quality
Later tokens: Gradually reduce rank configuration to improve overall compression rate
Determine optimal scheduling scheme through calibration dataset

Technical Innovations

Application of Fisher Information Criterion: Combines gradient and weight information to assess projection importance, more accurate than methods based solely on weight magnitude or gradients
Dynamic Compression Paradigm: Transcends static compression limitations by dynamically adjusting compression rate according to generation process characteristics
Fine-grained Optimization: Performs optimization at projection level rather than layer level, enabling more precise resource allocation
End-to-end Framework: Unifies rank allocation and dynamic decoding in a single framework for coordinated optimization

Experimental Setup

Datasets

Summarization Tasks: DialogSum, CNN/DM
Understanding Tasks: Wikitext2 (perplexity), 7 zero-shot tasks from LM-Evaluation-Harness
Calibration Data:
- Rank allocation: 256 sequences from Wikitext2 training set (length 2048)
- Scheduler: 500 samples from DialogSum training set

Evaluation Metrics

Generation Tasks: ROUGE-L, BERTScore
Understanding Tasks: Perplexity, zero-shot accuracy
Efficiency Metrics: Search time, inference speed

Baseline Methods

ASVD: Activation-aware singular value decomposition
SVD-LLM: Truncation-aware data whitening method
Ablation Studies: Test contributions of FLRA and PLRD components separately

Implementation Details

Models: LLaMA-2-7B-Chat, LLaMA-3-8B-Instruct, etc.
Compression Rates: 10%, 20%, 30%, and other levels
Hardware: A100 GPU
Based on SVD-LLM pipeline with FLRC's rank allocation and progressive decoding modules

Experimental Results

Main Results

Generation Task Performance

On LLaMA-3-8B-Instruct at 20% compression rate:

DialogSum ROUGE-L: FLRC 17.35% vs ASVD 0.10% vs SVD-LLM 0.24%
CNN/DM ROUGE-L: FLRC 17.72% vs ASVD 0.54% vs SVD-LLM 6.29%

Understanding Task Performance

On LLaMA-3-8B at 20% compression rate:

Wikitext2 Perplexity: FLRC 12.53 vs ASVD 3206.80 vs SVD-LLM 14.72
Average Zero-shot Accuracy: FLRC 43.66% vs ASVD 31.58% vs SVD-LLM 41.63%

Efficiency Improvements

Search Time: FLRC 3 minutes vs ASVD 147 minutes (49× speedup)
Inference Acceleration: Up to 2.12× speedup in offloading scenarios

Ablation Studies

On LLaMA-3-8B-Instruct at 20% compression rate for DialogSum task:

SVD-LLM only: 0.24% ROUGE-L
SVD-LLM + FLRA: 13.28% ROUGE-L
SVD-LLM + FLRA + PLRD: 17.35% ROUGE-L

Results demonstrate significant contributions from both components.

Case Analysis

Through importance analysis, we discovered:

Projection importance varies dramatically across different layers
down_proj typically has the highest importance scores
Later layers are more sensitive to compression than earlier layers

Experimental Findings

Layer-wise Heterogeneity: Significant differences exist in compression tolerance across different model layers
Decoding Sensitivity: Generation tasks are more sensitive to compression than understanding tasks
Scale Effects: FLRC's advantages become more pronounced in larger models
Generalizability: The method remains effective across different model architectures and precisions

Main Research Directions

Model Compression Techniques: Including pruning, quantization, knowledge distillation, etc.
Low-rank Decomposition Methods: SVD-based parameter matrix factorization techniques
Dynamic Inference: Adjusting model configuration based on input or computational stage

Compared to ASVD: Proposes more efficient rank allocation algorithm with significantly reduced search time
Compared to SVD-LLM: Introduces dynamic decoding mechanism with substantially improved generation task performance
Compared to Other Allocation Methods: Fisher-based approach is more efficient and accurate than Hessian-based and Bayesian optimization methods

Comparative Advantages

Efficiency Advantage: Completes rank allocation in single iteration, avoiding iterative optimization overhead
Accuracy Advantage: Projection-level fine-grained optimization is more precise than layer-level or block-level optimization
Adaptability Advantage: Dynamic adjustment mechanism better accommodates characteristics of generation tasks

Conclusions and Discussion

Main Conclusions

Effectiveness of Fine-grained Compression: Projection-level differentiated compression significantly outperforms uniform compression strategies
Necessity of Dynamic Decoding: Progressive rank adjustment is crucial for maintaining generation quality
Method Generalizability: FLRC demonstrates excellent performance across different model scales and task types
Practical Value: Substantially improved search efficiency makes the method practically deployable

Limitations

Calibration Data Dependency: Method performance is influenced by calibration dataset selection, with different datasets potentially leading to performance variations
Scheduler Overhead: Dynamic rank allocation introduces additional computational overhead requiring further engineering optimization
Memory-bound Scenarios: More effective in memory-constrained environments, but advantages may be less pronounced in compute-constrained scenarios

Future Directions

Engineering Optimization: Focus on reducing dynamic rank allocation overhead and designing specialized kernels
Adaptive Scheduling: Develop more intelligent scheduling algorithms to reduce calibration data dependency
Multimodal Extension: Extend the method to compression of multimodal large models

In-depth Evaluation

Strengths

Strong Novelty: First application of Fisher information criterion to fine-grained rank allocation in LLMs, proposing new dynamic decoding paradigm
Comprehensive Experiments: Covers multiple models, tasks, and compression rates with well-designed ablation studies
Significant Results: Achieves breakthrough improvements on generation tasks, addressing key limitations of existing methods
High Practical Value: Substantially reduced search time and good acceleration effects enable practical deployment
In-depth Analysis: Provides rich analytical experiments including importance visualization and sensitivity analysis

Weaknesses

Theoretical Foundation: Lacks theoretical analysis of why Fisher-based importance measure is optimal
Scheduling Strategy: Progressive decoding scheduling strategy is primarily empirical, lacking theoretical guidance
Hardware Optimization: Implementation details of dynamic rank allocation on hardware are insufficiently detailed
Comparison Scope: Primarily compares with SVD-based methods, with limited comparison to other compression techniques

Impact

Academic Contribution: Provides new research directions and technical pathways for LLM compression field
Practical Value: Significant performance improvements and efficiency gains have important industrial application value
Reproducibility: Clear method description and detailed experimental setup ensure good reproducibility
Inspirational Significance: Dynamic compression concepts may inspire further related research

Applicable Scenarios

Edge Deployment: Particularly suitable for resource-constrained environments like mobile devices and edge servers
Memory-constrained Scenarios: Especially effective when model offloading is required
Generation Tasks: Particularly valuable for text summarization, dialogue generation, and similar tasks
Large-scale Models: Advantages become more pronounced in larger models

References

The paper cites abundant related work, primarily including:

Yuan et al., 2023 - ASVD method
Wang et al., 2024 - SVD-LLM method
Touvron et al., 2023 - LLaMA model series
Multiple references for benchmark datasets and evaluation tools

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to key problems in the LLM compression field. The method design is sound, experimental validation is comprehensive, results are significant, and it possesses important academic and practical value. While there is room for improvement in theoretical analysis and hardware optimization, overall it represents an important contribution to the field.