2025-11-16T20:52:12.570613

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Lu, Chen, Chang et al.
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
academic

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Basic Information

  • Paper ID: 2510.09332
  • Title: FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
  • Authors: Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
  • Institutions: National Yang Ming Chiao Tung University, Macronix International Co., Ltd., Cornell University
  • Classification: cs.CL cs.AI
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09332

Abstract

Although large language models have achieved exceptional performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce memory usage and computational requirements; however, applying uniform compression ratios across all layers often leads to significant performance degradation, and existing methods perform poorly during the decoding phase. To address these issues, this paper proposes Fine-grained Low-Rank Compressor (FLRC), which efficiently determines optimal rank allocation for each layer and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate FLRC's superiority, achieving up to 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods.

Research Background and Motivation

Problem Definition

The core challenges faced by large language models (LLMs) are:

  1. Deployment Difficulty: Enormous parameter counts and high computational requirements make deployment on resource-constrained environments such as mobile devices and edge servers challenging
  2. Suboptimal Compression: Existing low-rank compression methods employ uniform compression ratios, neglecting the varying tolerance for compression across different layers
  3. Decoding Performance Degradation: Existing methods primarily focus on the prefilling phase, with significant performance decline in multi-turn decoding tasks such as text summarization

Research Motivation

  1. Practical Deployment Needs: With the proliferation of LLM applications, the demand for efficient deployment on resource-constrained devices is increasingly urgent
  2. Limitations of Existing Methods: Uniform compression strategies fail to fully exploit the heterogeneity of model structure
  3. Generation Quality Assurance: Text generation tasks require high continuous decoding quality, necessitating specialized optimization strategies

Core Contributions

  1. Proposes Fisher-based Layer-wise Rank Allocation Algorithm: Based on importance measures of gradients and weights, determines optimal rank allocation for each projection layer, achieving 49× speedup in search time compared to ASVD
  2. Introduces Progressive Low-Rank Decoding Mechanism: Dynamically adjusts rank allocation during decoding, allocating more parameters to early tokens and gradually reducing for later tokens, improving compression rate while maintaining generation quality
  3. Establishes Fine-grained Compression Framework: Combines layer-wise rank allocation with progressive decoding, forming a comprehensive LLM compression solution
  4. Achieves Significant Performance Improvements: Achieves up to 17.35% ROUGE-L improvement on summarization tasks compared to existing methods, while maintaining excellent performance on understanding tasks

Methodology Details

Task Definition

Input: Pre-trained large language model M, target compression ratio Output: Compressed model that reduces parameters and computational overhead while maintaining generation quality Constraints: Maximize model performance within given parameter budget

Model Architecture

1. Fisher-based Layer-wise Rank Allocation

The core idea of this algorithm is to assign different ranks to each projection layer in the model based on their importance, enabling differentiated compression.

Importance Calculation: For each projection p in layer l, the importance measure is defined as:

αl,p = Σi (Gl,p[i] × Wl,p[i])²

where Gl,p is the gradient and Wl,p is the weight parameter.

Rank Allocation Strategy:

rl,p = round(αl,p/S × Rbudget)

where S is the total importance score and Rbudget is the total rank budget.

Algorithm Flow:

  1. Compute gradients for each projection layer using calibration dataset
  2. Calculate importance scores based on gradients and weights
  3. Allocate rank budget proportionally to importance
  4. Generate layer-wise rank allocation scheme

2. Progressive Low-Rank Decoding

This mechanism is based on the observation that early tokens have greater impact on overall coherence and quality in text generation.

Dynamic Rank Adjustment:

rl,p(t) = round(αl,p/S × Rbudget(t))

where Rbudget(t) is the rank budget for the t-th token, satisfying non-increasing property.

Scheduling Strategy:

  • Early tokens: Use larger parameter sets to ensure generation quality
  • Later tokens: Gradually reduce rank configuration to improve overall compression rate
  • Determine optimal scheduling scheme through calibration dataset

Technical Innovations

  1. Application of Fisher Information Criterion: Combines gradient and weight information to assess projection importance, more accurate than methods based solely on weight magnitude or gradients
  2. Dynamic Compression Paradigm: Transcends static compression limitations by dynamically adjusting compression rate according to generation process characteristics
  3. Fine-grained Optimization: Performs optimization at projection level rather than layer level, enabling more precise resource allocation
  4. End-to-end Framework: Unifies rank allocation and dynamic decoding in a single framework for coordinated optimization

Experimental Setup

Datasets

  1. Summarization Tasks: DialogSum, CNN/DM
  2. Understanding Tasks: Wikitext2 (perplexity), 7 zero-shot tasks from LM-Evaluation-Harness
  3. Calibration Data:
    • Rank allocation: 256 sequences from Wikitext2 training set (length 2048)
    • Scheduler: 500 samples from DialogSum training set

Evaluation Metrics

  1. Generation Tasks: ROUGE-L, BERTScore
  2. Understanding Tasks: Perplexity, zero-shot accuracy
  3. Efficiency Metrics: Search time, inference speed

Baseline Methods

  1. ASVD: Activation-aware singular value decomposition
  2. SVD-LLM: Truncation-aware data whitening method
  3. Ablation Studies: Test contributions of FLRA and PLRD components separately

Implementation Details

  • Models: LLaMA-2-7B-Chat, LLaMA-3-8B-Instruct, etc.
  • Compression Rates: 10%, 20%, 30%, and other levels
  • Hardware: A100 GPU
  • Based on SVD-LLM pipeline with FLRC's rank allocation and progressive decoding modules

Experimental Results

Main Results

Generation Task Performance

On LLaMA-3-8B-Instruct at 20% compression rate:

  • DialogSum ROUGE-L: FLRC 17.35% vs ASVD 0.10% vs SVD-LLM 0.24%
  • CNN/DM ROUGE-L: FLRC 17.72% vs ASVD 0.54% vs SVD-LLM 6.29%

Understanding Task Performance

On LLaMA-3-8B at 20% compression rate:

  • Wikitext2 Perplexity: FLRC 12.53 vs ASVD 3206.80 vs SVD-LLM 14.72
  • Average Zero-shot Accuracy: FLRC 43.66% vs ASVD 31.58% vs SVD-LLM 41.63%

Efficiency Improvements

  • Search Time: FLRC 3 minutes vs ASVD 147 minutes (49× speedup)
  • Inference Acceleration: Up to 2.12× speedup in offloading scenarios

Ablation Studies

On LLaMA-3-8B-Instruct at 20% compression rate for DialogSum task:

  • SVD-LLM only: 0.24% ROUGE-L
  • SVD-LLM + FLRA: 13.28% ROUGE-L
  • SVD-LLM + FLRA + PLRD: 17.35% ROUGE-L

Results demonstrate significant contributions from both components.

Case Analysis

Through importance analysis, we discovered:

  • Projection importance varies dramatically across different layers
  • down_proj typically has the highest importance scores
  • Later layers are more sensitive to compression than earlier layers

Experimental Findings

  1. Layer-wise Heterogeneity: Significant differences exist in compression tolerance across different model layers
  2. Decoding Sensitivity: Generation tasks are more sensitive to compression than understanding tasks
  3. Scale Effects: FLRC's advantages become more pronounced in larger models
  4. Generalizability: The method remains effective across different model architectures and precisions

Main Research Directions

  1. Model Compression Techniques: Including pruning, quantization, knowledge distillation, etc.
  2. Low-rank Decomposition Methods: SVD-based parameter matrix factorization techniques
  3. Dynamic Inference: Adjusting model configuration based on input or computational stage
  1. Compared to ASVD: Proposes more efficient rank allocation algorithm with significantly reduced search time
  2. Compared to SVD-LLM: Introduces dynamic decoding mechanism with substantially improved generation task performance
  3. Compared to Other Allocation Methods: Fisher-based approach is more efficient and accurate than Hessian-based and Bayesian optimization methods

Comparative Advantages

  1. Efficiency Advantage: Completes rank allocation in single iteration, avoiding iterative optimization overhead
  2. Accuracy Advantage: Projection-level fine-grained optimization is more precise than layer-level or block-level optimization
  3. Adaptability Advantage: Dynamic adjustment mechanism better accommodates characteristics of generation tasks

Conclusions and Discussion

Main Conclusions

  1. Effectiveness of Fine-grained Compression: Projection-level differentiated compression significantly outperforms uniform compression strategies
  2. Necessity of Dynamic Decoding: Progressive rank adjustment is crucial for maintaining generation quality
  3. Method Generalizability: FLRC demonstrates excellent performance across different model scales and task types
  4. Practical Value: Substantially improved search efficiency makes the method practically deployable

Limitations

  1. Calibration Data Dependency: Method performance is influenced by calibration dataset selection, with different datasets potentially leading to performance variations
  2. Scheduler Overhead: Dynamic rank allocation introduces additional computational overhead requiring further engineering optimization
  3. Memory-bound Scenarios: More effective in memory-constrained environments, but advantages may be less pronounced in compute-constrained scenarios

Future Directions

  1. Engineering Optimization: Focus on reducing dynamic rank allocation overhead and designing specialized kernels
  2. Adaptive Scheduling: Develop more intelligent scheduling algorithms to reduce calibration data dependency
  3. Multimodal Extension: Extend the method to compression of multimodal large models

In-depth Evaluation

Strengths

  1. Strong Novelty: First application of Fisher information criterion to fine-grained rank allocation in LLMs, proposing new dynamic decoding paradigm
  2. Comprehensive Experiments: Covers multiple models, tasks, and compression rates with well-designed ablation studies
  3. Significant Results: Achieves breakthrough improvements on generation tasks, addressing key limitations of existing methods
  4. High Practical Value: Substantially reduced search time and good acceleration effects enable practical deployment
  5. In-depth Analysis: Provides rich analytical experiments including importance visualization and sensitivity analysis

Weaknesses

  1. Theoretical Foundation: Lacks theoretical analysis of why Fisher-based importance measure is optimal
  2. Scheduling Strategy: Progressive decoding scheduling strategy is primarily empirical, lacking theoretical guidance
  3. Hardware Optimization: Implementation details of dynamic rank allocation on hardware are insufficiently detailed
  4. Comparison Scope: Primarily compares with SVD-based methods, with limited comparison to other compression techniques

Impact

  1. Academic Contribution: Provides new research directions and technical pathways for LLM compression field
  2. Practical Value: Significant performance improvements and efficiency gains have important industrial application value
  3. Reproducibility: Clear method description and detailed experimental setup ensure good reproducibility
  4. Inspirational Significance: Dynamic compression concepts may inspire further related research

Applicable Scenarios

  1. Edge Deployment: Particularly suitable for resource-constrained environments like mobile devices and edge servers
  2. Memory-constrained Scenarios: Especially effective when model offloading is required
  3. Generation Tasks: Particularly valuable for text summarization, dialogue generation, and similar tasks
  4. Large-scale Models: Advantages become more pronounced in larger models

References

The paper cites abundant related work, primarily including:

  1. Yuan et al., 2023 - ASVD method
  2. Wang et al., 2024 - SVD-LLM method
  3. Touvron et al., 2023 - LLaMA model series
  4. Multiple references for benchmark datasets and evaluation tools

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to key problems in the LLM compression field. The method design is sound, experimental validation is comprehensive, results are significant, and it possesses important academic and practical value. While there is room for improvement in theoretical analysis and hardware optimization, overall it represents an important contribution to the field.