2025-11-17T05:22:13.097937

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Gou, Byun, Malpeddi et al.
Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
academic

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Basic Information

  • Paper ID: 2510.12721
  • Title: CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
  • Authors: Dayin Gou*, Sanghyun Byun*, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung†
  • Institution: LG Electronics USA
  • Classification: cs.LG
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12721v1

Abstract

Large Language Models (LLMs) typically rely on substantial parameters for token embeddings, resulting in enormous storage requirements and memory consumption. LLMs deployed on edge devices are particularly constrained by memory limitations; compressing the embedding layer to reduce memory footprint not only frees up memory bandwidth but also accelerates inference. To address this, we propose CARVQ, a novel post-training method combining a corrective adaptor with group residual vector quantization. CARVQ leverages a combination of linear and nonlinear mappings to mimic original model embeddings, achieving compression to approximately 1.6 bits per parameter without requiring specialized hardware support for low-bit storage. The method is evaluated on multiple pre-trained LLMs across generative, discriminative, mathematical, and reasoning tasks, demonstrating that CARVQ achieves lower average bits-per-parameter while maintaining reasonable perplexity and accuracy.

Research Background and Motivation

Problem Definition

  1. Core Problem: The embedding layer of large language models consumes substantial memory, particularly becoming a performance bottleneck when deployed on edge devices
  2. Practical Requirement: Efficient deployment of LLMs on memory-constrained edge devices
  3. Technical Challenge: Existing quantization methods suffer severe performance degradation at ultra-low bit-widths and require specialized hardware support

Problem Significance

  • Memory Proportion Issue: When transformer layers are quantized, the relative memory proportion of the embedding layer increases significantly (e.g., 52.06% in LLaMA-3.2-1B INT4 models)
  • Edge Computing Demand: Edge device memory is typically limited to several GB; saving 0.5GB enables support for an additional 2B 4-bit parameters or longer context
  • Hardware Compatibility: Existing low-bit quantization methods require specialized hardware support, limiting deployment flexibility

Limitations of Existing Methods

  1. Scalar Quantization: Performance degrades sharply below 2 bits and requires special hardware support
  2. Quantization-Aware Training (QAT): Requires original training data and substantial computational resources for retraining
  3. Existing Embedding Compression Methods: Linear methods like TensorGPT suffer severe accuracy loss at high compression ratios

Core Contributions

  1. Proposes CARVQ Method: A novel post-training compression technique combining corrective adaptor and group residual vector quantization, requiring no specialized hardware support
  2. Achieves Ultra-Low Bit-Width Compression: Maintains reasonable performance at average 1.6 bits per parameter, while scalar quantization fails below 3 bits
  3. Hardware Compatibility: Compatible with existing transformer layer quantization methods, using only 4-bit and 16-bit data types
  4. Extensive Validation: Verified on 7 pre-trained models of different scales, covering four task categories: generative, discriminative, mathematical, and reasoning

Method Details

Task Definition

Input: Embedding matrix MRV×nM \in \mathbb{R}^{V \times n} of pre-trained LLM, where VV is vocabulary size and nn is embedding dimension Output: Compressed embedding representation including quantized lookup table and corrective adaptor Objective: Minimize reconstruction error while achieving maximum compression ratio

Model Architecture

1. Group Residual Vector Quantization (Group RVQ)

  • Matrix Reshaping: Reshape embedding matrix to MRnV/h×hM' \in \mathbb{R}^{nV/h \times h}, where hh is sub-vector dimension
  • Grouping Operation: Partition MM' into nV/ghnV/gh groups, each of size g×hg \times h
  • Iterative Quantization: Apply LL iterations of RVQ to each group, each using codebook with 2κ2^κ centroids
  • Storage Format: Codebooks stored at original precision pp bits, indices stored at κκ bits

2. Corrective Adaptor

Design Philosophy: Employs shrinkage-expansion strategy to reduce parameter count

  • Shrinkage Mapping: σ0:WRm\sigma_0: W \rightarrow \mathbb{R}^m, mapping tokens to low-dimensional vectors (mnm \ll n)
  • Expansion Mapping: σ1:RmRn\sigma_1: \mathbb{R}^m \rightarrow \mathbb{R}^n, expanding back to original dimension via multi-layer perceptron

MLP Structure: σ1=hLhNLkhNL1\sigma_1 = h_L \circ h_{NL_k} \circ \cdots \circ h_{NL_1} where hNLi(x)=ReLU(Wix+bi)h_{NL_i}(x) = \text{ReLU}(W_i \cdot x + b_i), hL(x)=WLx+bLh_L(x) = W_L \cdot x + b_L

3. CARVQ Overall Framework

Combination Strategy: Final embedding = Group RVQ output + Corrective adaptor output Training Objective: Minimize L1 reconstruction error L=i=1VMi(RVQ(Mi)+σ1(σ0(Ti)))1\mathcal{L} = \sum_{i=1}^{V} ||M_i - (\text{RVQ}(M_i) + \sigma_1(\sigma_0(T_i)))||_1

Technical Innovations

  1. Nonlinear Error Compensation: Corrective adaptor compensates RVQ quantization error through nonlinear mapping
  2. Hardware-Friendly Design: Uses only 4-bit and 16-bit data types, compatible with existing hardware
  3. Parameter Efficiency: Corrective adaptor parameters far smaller than RVQ, overall compression ratio dominated by RVQ
  4. Post-Training Characteristic: No retraining required, directly applicable to pre-trained models

Compression Ratio Analysis

Average Bits-Per-Parameter: BCARVQ=BCA+BRVQB_{CARVQ} = B_{CA} + B_{RVQ} where: BRVQ=p×Lh2κ×p+gLκgh×pB_{RVQ} = p \times \frac{Lh2^κ \times p + gLκ}{gh \times p}BCA=p×NPnVB_{CA} = p \times \frac{N_P}{nV}

Experimental Setup

Datasets

  • Generative Tasks: WikiText-2 perplexity evaluation
  • Discriminative Tasks: HellaSwag, WinoGrande, PIQA
  • Mathematical Tasks: GSM8K
  • Reasoning Tasks: ARC Challenge, ARC Easy

Evaluation Metrics

  • Perplexity: Measures generation quality
  • Accuracy: Performance on discriminative and reasoning tasks
  • Average Bits-Per-Parameter: Compression efficiency metric
  • Memory Savings: Practical deployment benefits

Comparison Methods

  • Scalar Quantization: INT4, INT3, INT2 standard quantization
  • AWQ Quantization: Activation-aware weight quantization
  • Ablation Studies: CA+scalar quantization vs CARVQ

Implementation Details

  • Hyperparameters: [m1,m2,m3]=[16,384,512][m_1, m_2, m_3] = [16, 384, 512], κ=4κ=4, h=8h=8, g=1024g=1024
  • Training: Adam optimizer, learning rate 1e-3, 500 iterations
  • Hardware: RTX 4090, training time approximately 2 minutes

Experimental Results

Main Results

Generative Task Performance

MethodAverage Bit-WidthPerplexity Increase
CARVQ-43.1550.238
CARVQ-32.4050.532
CARVQ-21.6553.544
INT33.00.750
INT22.083.88

Discriminative Task Performance

  • CARVQ-3: Average accuracy decrease of 0.70%
  • CARVQ-2: Average accuracy decrease of 2.75%
  • INT2: Average accuracy decrease of 8.23%

Ablation Studies

RVQ vs Scalar Quantization Comparison:

  • CARVQ-2 (1.655 bits): WikiText-2 perplexity 16.34
  • CA+INT1 (1.155 bits): WikiText-2 perplexity 14528
  • Demonstrates significant advantages of RVQ over scalar quantization

Compatibility Verification

Combined with AWQ:

  • LLaMA-3.2-3B: CARVQ-3+AWQ perplexity increase only 0.95
  • Qwen2.5-3B: CARVQ-3+AWQ perplexity increase only 0.30
  • Demonstrates good compatibility with existing quantization methods

Experimental Findings

  1. Model Scale Effect: Larger models are more robust to embedding layer quantization
  2. Task Sensitivity: Mathematical tasks most sensitive to compression; reasoning tasks relatively robust
  3. Sweet Spot Configuration: CARVQ-3 achieves optimal balance between compression ratio and performance

Architecture-Preserving Compression

  • Quantization Methods: AWQ, SmoothQuant and other activation-aware quantization
  • Pruning Methods: Structured pruning, attention head pruning
  • Our Advantage: Focuses on embedding layer, orthogonal and compatible with existing methods

Architecture-Adaptive Compression

  • LoRA: Low-rank adaptation for fine-tuning
  • Tensor Decomposition: Tensor train decomposition and related methods
  • Our Distinction: Post-training compression without requiring retraining

Embedding Layer Compression

  • TensorGPT: Based on tensor train decomposition, but linear nature limits high compression performance
  • Dynamic Vocabulary Pruning: Requires fine-tuning, poor generalization
  • Our Contribution: First efficient post-training compression method for embedding layers

Conclusions and Discussion

Main Conclusions

  1. CARVQ achieves 1.6-bit average compression rate, significantly outperforming the 3-bit lower bound of scalar quantization
  2. The method exhibits good hardware compatibility, requiring only 4-bit and 16-bit data type support
  3. Orthogonally compatible with existing transformer quantization methods, enabling seamless integration

Limitations

  1. Applicable Scope: Primarily suitable for small models; embedding layer proportion relatively smaller in large models
  2. Computational Complexity: Cannot be directly applied to continuous activation transformer layers
  3. Semantic Information: May lose fine-grained semantic information, affecting tasks dependent on subtle representations
  4. Error Propagation: Combined with overly lossy transformer compression may impact overall robustness

Future Directions

  1. Extend application to larger-scale models
  2. Investigate deep integration with other compression techniques
  3. Develop specialized hardware acceleration for lookup table operations
  4. Explore compression methods preserving semantic structure

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to combine corrective adaptor with group RVQ, solving embedding layer compression challenge
  2. High Practical Value: Addresses actual deployment needs on edge devices with direct application value
  3. Comprehensive Experiments: Full evaluation across 7 models and 4 task categories
  4. Engineering-Friendly: Good hardware compatibility, easy to deploy

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why this combination is effective
  2. Limited Applicable Scenarios: Primarily targets small models; advantages not obvious for large models
  3. Unknown Long-Term Impact: Effects on downstream tasks like model fine-tuning and continual learning require further investigation

Impact

  1. Technical Contribution: Provides new technical pathway for LLM edge deployment
  2. Industrial Value: Significant importance for LLM deployment on mobile devices and IoT devices
  3. Research Inspiration: May catalyze more research on embedding layer compression and adaptor design

Applicable Scenarios

  1. Edge Computing: Memory-constrained mobile devices, IoT devices
  2. Real-Time Applications: Conversational systems and recommendation systems requiring fast response
  3. Cost-Sensitive Scenarios: Applications requiring LLM deployment with limited hardware resources

References

  1. Lin et al. (2024). AWQ: Activation-aware weight quantization for llm compression and acceleration
  2. Hu et al. (2022). LoRA: Low-rank adaptation of large language models
  3. Xu et al. (2023). TensorGPT: Efficient compression of the embedding layer in llms based on the tensor-train decomposition
  4. Xiao et al. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models

Overall Assessment: This is a high-quality technical paper addressing practical deployment requirements. The proposed CARVQ method represents an important breakthrough in embedding layer compression, providing an effective solution for LLM edge deployment. Despite certain limitations, its innovation, practicality, and engineering value make it a significant contribution to the field.