2025-11-17T05:22:13.097937

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Gou, Byun, Malpeddi et al.

Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.

academic

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Basic Information

Paper ID: 2510.12721
Title: CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Authors: Dayin Gou*, Sanghyun Byun*, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung†
Institution: LG Electronics USA
Classification: cs.LG
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12721v1

Abstract

Large Language Models (LLMs) typically rely on substantial parameters for token embeddings, resulting in enormous storage requirements and memory consumption. LLMs deployed on edge devices are particularly constrained by memory limitations; compressing the embedding layer to reduce memory footprint not only frees up memory bandwidth but also accelerates inference. To address this, we propose CARVQ, a novel post-training method combining a corrective adaptor with group residual vector quantization. CARVQ leverages a combination of linear and nonlinear mappings to mimic original model embeddings, achieving compression to approximately 1.6 bits per parameter without requiring specialized hardware support for low-bit storage. The method is evaluated on multiple pre-trained LLMs across generative, discriminative, mathematical, and reasoning tasks, demonstrating that CARVQ achieves lower average bits-per-parameter while maintaining reasonable perplexity and accuracy.

Research Background and Motivation

Problem Definition

Core Problem: The embedding layer of large language models consumes substantial memory, particularly becoming a performance bottleneck when deployed on edge devices
Practical Requirement: Efficient deployment of LLMs on memory-constrained edge devices
Technical Challenge: Existing quantization methods suffer severe performance degradation at ultra-low bit-widths and require specialized hardware support

Problem Significance

Memory Proportion Issue: When transformer layers are quantized, the relative memory proportion of the embedding layer increases significantly (e.g., 52.06% in LLaMA-3.2-1B INT4 models)
Edge Computing Demand: Edge device memory is typically limited to several GB; saving 0.5GB enables support for an additional 2B 4-bit parameters or longer context
Hardware Compatibility: Existing low-bit quantization methods require specialized hardware support, limiting deployment flexibility

Limitations of Existing Methods

Scalar Quantization: Performance degrades sharply below 2 bits and requires special hardware support
Quantization-Aware Training (QAT): Requires original training data and substantial computational resources for retraining
Existing Embedding Compression Methods: Linear methods like TensorGPT suffer severe accuracy loss at high compression ratios

Core Contributions

Proposes CARVQ Method: A novel post-training compression technique combining corrective adaptor and group residual vector quantization, requiring no specialized hardware support
Achieves Ultra-Low Bit-Width Compression: Maintains reasonable performance at average 1.6 bits per parameter, while scalar quantization fails below 3 bits
Hardware Compatibility: Compatible with existing transformer layer quantization methods, using only 4-bit and 16-bit data types
Extensive Validation: Verified on 7 pre-trained models of different scales, covering four task categories: generative, discriminative, mathematical, and reasoning

Method Details

Task Definition

Input: Embedding matrix $M \in \mathbb{R}^{V \times n}$ of pre-trained LLM, where $V$ is vocabulary size and $n$ is embedding dimension Output: Compressed embedding representation including quantized lookup table and corrective adaptor Objective: Minimize reconstruction error while achieving maximum compression ratio

Model Architecture

1. Group Residual Vector Quantization (Group RVQ)

Matrix Reshaping: Reshape embedding matrix to $M' \in \mathbb{R}^{nV/h \times h}$ , where $h$ is sub-vector dimension
Grouping Operation: Partition $M'$ into $nV/gh$ groups, each of size $g \times h$
Iterative Quantization: Apply $L$ iterations of RVQ to each group, each using codebook with $2^κ$ centroids
Storage Format: Codebooks stored at original precision $p$ bits, indices stored at $κ$ bits

2. Corrective Adaptor

Design Philosophy: Employs shrinkage-expansion strategy to reduce parameter count

Shrinkage Mapping: $\sigma_0: W \rightarrow \mathbb{R}^m$ , mapping tokens to low-dimensional vectors ( $m \ll n$ )
Expansion Mapping: $\sigma_1: \mathbb{R}^m \rightarrow \mathbb{R}^n$ , expanding back to original dimension via multi-layer perceptron

MLP Structure: $\sigma_1 = h_L \circ h_{NL_k} \circ \cdots \circ h_{NL_1}$ where $h_{NL_i}(x) = \text{ReLU}(W_i \cdot x + b_i)$ , $h_L(x) = W_L \cdot x + b_L$

3. CARVQ Overall Framework

Combination Strategy: Final embedding = Group RVQ output + Corrective adaptor output Training Objective: Minimize L1 reconstruction error $\mathcal{L} = \sum_{i=1}^{V} ||M_i - (\text{RVQ}(M_i) + \sigma_1(\sigma_0(T_i)))||_1$

Technical Innovations

Nonlinear Error Compensation: Corrective adaptor compensates RVQ quantization error through nonlinear mapping
Hardware-Friendly Design: Uses only 4-bit and 16-bit data types, compatible with existing hardware
Parameter Efficiency: Corrective adaptor parameters far smaller than RVQ, overall compression ratio dominated by RVQ
Post-Training Characteristic: No retraining required, directly applicable to pre-trained models

Compression Ratio Analysis

Average Bits-Per-Parameter: $B_{CARVQ} = B_{CA} + B_{RVQ}$ where: $B_{RVQ} = p \times \frac{Lh2^κ \times p + gLκ}{gh \times p}$ $B_{CA} = p \times \frac{N_P}{nV}$

Experimental Setup

Datasets

Generative Tasks: WikiText-2 perplexity evaluation
Discriminative Tasks: HellaSwag, WinoGrande, PIQA
Mathematical Tasks: GSM8K
Reasoning Tasks: ARC Challenge, ARC Easy

Evaluation Metrics

Perplexity: Measures generation quality
Accuracy: Performance on discriminative and reasoning tasks
Average Bits-Per-Parameter: Compression efficiency metric
Memory Savings: Practical deployment benefits

Comparison Methods

Scalar Quantization: INT4, INT3, INT2 standard quantization
AWQ Quantization: Activation-aware weight quantization
Ablation Studies: CA+scalar quantization vs CARVQ

Implementation Details

Hyperparameters: $[m_1, m_2, m_3] = [16, 384, 512]$ , $κ=4$ , $h=8$ , $g=1024$
Training: Adam optimizer, learning rate 1e-3, 500 iterations
Hardware: RTX 4090, training time approximately 2 minutes

Experimental Results

Main Results

Generative Task Performance

Method	Average Bit-Width	Perplexity Increase
CARVQ-4	3.155	0.238
CARVQ-3	2.405	0.532
CARVQ-2	1.655	3.544
INT3	3.0	0.750
INT2	2.0	83.88

Discriminative Task Performance

CARVQ-3: Average accuracy decrease of 0.70%
CARVQ-2: Average accuracy decrease of 2.75%
INT2: Average accuracy decrease of 8.23%

Ablation Studies

RVQ vs Scalar Quantization Comparison:

CARVQ-2 (1.655 bits): WikiText-2 perplexity 16.34
CA+INT1 (1.155 bits): WikiText-2 perplexity 14528
Demonstrates significant advantages of RVQ over scalar quantization

Compatibility Verification

Combined with AWQ:

LLaMA-3.2-3B: CARVQ-3+AWQ perplexity increase only 0.95
Qwen2.5-3B: CARVQ-3+AWQ perplexity increase only 0.30
Demonstrates good compatibility with existing quantization methods

Experimental Findings

Model Scale Effect: Larger models are more robust to embedding layer quantization
Task Sensitivity: Mathematical tasks most sensitive to compression; reasoning tasks relatively robust
Sweet Spot Configuration: CARVQ-3 achieves optimal balance between compression ratio and performance

Architecture-Preserving Compression

Quantization Methods: AWQ, SmoothQuant and other activation-aware quantization
Pruning Methods: Structured pruning, attention head pruning
Our Advantage: Focuses on embedding layer, orthogonal and compatible with existing methods

Architecture-Adaptive Compression

LoRA: Low-rank adaptation for fine-tuning
Tensor Decomposition: Tensor train decomposition and related methods
Our Distinction: Post-training compression without requiring retraining

Embedding Layer Compression

TensorGPT: Based on tensor train decomposition, but linear nature limits high compression performance
Dynamic Vocabulary Pruning: Requires fine-tuning, poor generalization
Our Contribution: First efficient post-training compression method for embedding layers

Conclusions and Discussion

Main Conclusions

CARVQ achieves 1.6-bit average compression rate, significantly outperforming the 3-bit lower bound of scalar quantization
The method exhibits good hardware compatibility, requiring only 4-bit and 16-bit data type support
Orthogonally compatible with existing transformer quantization methods, enabling seamless integration

Limitations

Applicable Scope: Primarily suitable for small models; embedding layer proportion relatively smaller in large models
Computational Complexity: Cannot be directly applied to continuous activation transformer layers
Semantic Information: May lose fine-grained semantic information, affecting tasks dependent on subtle representations
Error Propagation: Combined with overly lossy transformer compression may impact overall robustness

Future Directions

Extend application to larger-scale models
Investigate deep integration with other compression techniques
Develop specialized hardware acceleration for lookup table operations
Explore compression methods preserving semantic structure

In-Depth Evaluation

Strengths

Strong Innovation: First to combine corrective adaptor with group RVQ, solving embedding layer compression challenge
High Practical Value: Addresses actual deployment needs on edge devices with direct application value
Comprehensive Experiments: Full evaluation across 7 models and 4 task categories
Engineering-Friendly: Good hardware compatibility, easy to deploy

Weaknesses

Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why this combination is effective
Limited Applicable Scenarios: Primarily targets small models; advantages not obvious for large models
Unknown Long-Term Impact: Effects on downstream tasks like model fine-tuning and continual learning require further investigation

Impact

Technical Contribution: Provides new technical pathway for LLM edge deployment
Industrial Value: Significant importance for LLM deployment on mobile devices and IoT devices
Research Inspiration: May catalyze more research on embedding layer compression and adaptor design

Applicable Scenarios

Edge Computing: Memory-constrained mobile devices, IoT devices
Real-Time Applications: Conversational systems and recommendation systems requiring fast response
Cost-Sensitive Scenarios: Applications requiring LLM deployment with limited hardware resources

References

Lin et al. (2024). AWQ: Activation-aware weight quantization for llm compression and acceleration
Hu et al. (2022). LoRA: Low-rank adaptation of large language models
Xu et al. (2023). TensorGPT: Efficient compression of the embedding layer in llms based on the tensor-train decomposition
Xiao et al. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models

Overall Assessment: This is a high-quality technical paper addressing practical deployment requirements. The proposed CARVQ method represents an important breakthrough in embedding layer compression, providing an effective solution for LLM edge deployment. Despite certain limitations, its innovation, practicality, and engineering value make it a significant contribution to the field.