2025-11-10T03:09:53.117606

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Kwek, Yin

Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.

academic

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Basic Information

Paper ID: 2509.06836
Title: COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
Authors: Eugene Kwek, Wenpeng Yin (Penn State University)
Categories: cs.CL cs.AI cs.LG
Publication Status: Preprint under review
Paper Link: https://arxiv.org/abs/2509.06836v3

Abstract

This paper proposes COMPACT, a pruning method to address efficiency optimization of Large Language Models (LLMs) in terms of memory, latency, and serving costs. The method combines vocabulary pruning with FFN channel pruning weighted by common tokens, achieving parameter compression while maintaining the standard transformer architecture. The approach is validated on model families including Qwen, LLaMA, and Gemma (0.5B-70B parameters).

Research Background and Motivation

Problem Definition

Although Large Language Models demonstrate excellent performance across various NLP tasks, their enormous parameter counts (billions to hundreds of billions) result in prohibitively high deployment costs, limiting their application in edge devices, interactive applications, and large-scale inference.

Limitations of Existing Methods

Width Pruning: Removes hidden dimensions or channels but disrupts the standard transformer architecture, requiring custom inference code
Depth Pruning: Removes entire transformer blocks, preserving architecture but causing dramatic performance degradation
Poor Scalability: Existing methods are effective on large models but perform poorly on Small Language Models (SLMs)
Neglects Linguistic Properties: Fails to account for token importance differences, treating all tokens equally

Research Motivation

Through analysis, the authors discovered:

Significant differences in parameter distribution across models of different scales: vocabulary parameters dominate in small models, while FFN parameters dominate in large models
Natural language follows Zipf distribution, with rare tokens appearing with extremely low frequency and contributing minimally to downstream performance

Core Contributions

Systematic Analysis: First systematic analysis of embedding, FFN, and attention parameter distributions across LLMs of different scales
COMPACT Method: Proposes a novel framework combining vocabulary pruning and common-token-weighted FFN pruning
Architecture Compatibility: Maintains standard transformer architecture, compatible with existing inference frameworks
Scale Adaptivity: Achieves SOTA performance across multiple model families from 0.5B to 70B parameters

Method Details

Parameter Distribution Analysis

The authors first analyze parameter distribution in modern decoder-only transformers:

Vocabulary Parameters: $N_{vocab} = 2VD$ (embedding and LM head layers)
FFN Parameters: $N_{FFN} = 3LDI$ (L layers, intermediate dimension I)
Attention Parameters: $N_{attention} = 2LD^2(1 + \frac{1}{H})$ (H as head ratio)

As model scale increases, $N_{FFN}$ and $N_{attention}$ grow as $O(LD^2)$ , while $N_{vocab}$ grows only as $O(D)$ . Therefore, vocabulary parameters constitute a larger proportion in small models.

COMPACT Architecture

1. Vocabulary Pruning

Principle: Based on BPE tokenizer following Zipf distribution, removes the rarest $V-V'$ tokens
Implementation: Directly deletes corresponding rows from embedding and LM head matrices, as well as merge rules in the tokenizer
Advantages: Requires no calibration data, computationally efficient

2. Common-Token-Weighted FFN Pruning

Traditional act² method computes channel importance as: $I_k = \sum_{i=1}^{N} (SiLU(X_iW_{gate})X_iW_{up})^2_k$

COMPACT's proposed common act² method: $I_k = \sum_{i=1}^{N} w_i(SiLU(X_iW_{gate})X_iW_{up})^2_k, \quad w_i = \begin{cases} 0 & x_i \in S \\ 1 & \text{otherwise} \end{cases}$

where $S$ is the set of rare tokens to be pruned.

Algorithm Flow

Algorithm 1 COMPACT
Input: Model M, calibration dataset D, target vocabulary size V', target intermediate dimension I'
1. Identify set S of rarest V-V' tokens
2. Run forward pass on dataset D, collect squared activations
3. For each channel k, compute importance Ik using common act²
4. For each layer: prune I-I' least important channels
5. Prune vocabulary parameters: remove last V-V' rows from embedding and LM head matrices
6. Return pruned model M'

Technical Innovations

Dual Pruning Strategy: Combines vocabulary and FFN pruning, targeting parameter distribution characteristics of different-scale models
Common-Token Weighting: FFN pruning considers only tokens that remain valid after pruning, avoiding misleading guidance from rare tokens
Architecture Preservation: Only prunes vocabulary size and intermediate dimension, maintaining standard transformer structure
Scale Adaptivity: Adapts to different-scale requirements by adjusting two hyperparameters $V'$ and $I'$

Experimental Setup

Evaluated Models

Small Language Models: Qwen 2.5-0.5B, LLaMA 3.2-1B, Gemma 3-1B
Large Language Models: LLaMA 3.1-8B, LLaMA 3.1-70B

Datasets and Tasks

Calibration Data: 256 samples from C4 dataset
Evaluation Tasks: MMLU, HellaSwag, WinoGrande, ARC-C/E, PIQA, GSM8K

Baseline Methods

Depth Pruning: ShortGPT, LaCo
Width Pruning: SliceGPT, 2SSP, FLAP

Evaluation Metrics

Parameter pruning ratio, average accuracy, relative performance retention
Pruning time, inference throughput, GPU memory usage

Experimental Results

Main Results

Small Language Model Performance

On Qwen 2.5-0.5B at 35% pruning ratio:

COMPACT: Average accuracy 35.3% (70.4% relative performance)
Best baseline: 31.4% (62.5% relative performance)

On LLaMA 3.2-1B at 35% pruning ratio:

COMPACT: Average accuracy 36.9% (76.4% relative performance)
Best baseline: 33.6% (69.6% relative performance)

Large Language Model Performance

On LLaMA 3.1-70B at 35% pruning ratio:

COMPACT: Average accuracy 63.7% (80.2% relative performance)
2SSP: 62.8% (79.1% relative performance)

Efficiency Analysis

Pruning Time Comparison (LLaMA 3.1-8B, 35% pruning)

COMPACT: 0:32
2SSP: 1:26
SliceGPT: 10:48

Inference Efficiency (LLaMA 3.1-8B, 35% pruning)

Memory Usage: COMPACT reduces 36% (best), ShortGPT/LaCo reduce 25%
Throughput Improvement: COMPACT improves 37%, ShortGPT/LaCo improve 57%

Ablation Studies

Common act² Effectiveness

On Qwen 2.5-0.5B at 35% pruning:

Common act²: 70.4% relative performance
Standard act²: 69.2% relative performance
|act| method: 67.6% relative performance

Vocabulary-FFN Trade-off Analysis

Fixed 37% pruning ratio, different $V'$ and $I'$ combinations:

Pure FFN pruning (V'=151936): 63.0% relative performance
Optimal combination (V'=49536): 70.4% relative performance

Key Findings

Smooth Degradation: COMPACT exhibits smooth performance decay, while depth pruning methods show sudden performance jumps
Architecture Agnosticism: COMPACT can be directly applied to new architectures like Gemma 3, while other methods require architecture-specific modifications
Limited Impact of Rare Tokens: 67% vocabulary reduction affects only 4% of text retokenization

Depth Pruning

Representative Methods: Shortened LLaMA, SLEB, LLM-Streamline
Advantages: Maintains standard architecture, significant inference acceleration
Disadvantages: Coarse-grained removal causes dramatic performance degradation

Width Pruning

Representative Methods: LLM-Pruner, SliceGPT, FLAP, 2SSP
Advantages: Fine-grained control, relatively smooth performance decay
Disadvantages: Disrupts standard architecture, requires custom inference code

Vocabulary Pruning

Existing Work: Primarily focused on language/domain-specific vocabulary trimming
This Paper's Contribution: General-purpose LLM vocabulary pruning, forming a complete framework combined with FFN pruning

Conclusions and Discussion

Main Conclusions

COMPACT achieves SOTA pruning performance across multiple model families and scales
The method maintains standard transformer architecture with good deployment compatibility
The dual pruning strategy effectively adapts to parameter distribution characteristics of different-scale models

Limitations

Limited Throughput Improvement: Compared to depth pruning methods, still lags in inference throughput gains
Domain Adaptivity of Vocabulary Pruning: May require retaining more domain-specific vocabulary in specialized domains
Hyperparameter Tuning: Requires finding optimal $V'$ and $I'$ combinations for different pruning ratios

Future Directions

The authors propose the need to further narrow the throughput gap between width and depth pruning methods.

In-Depth Evaluation

Strengths

Solid Theoretical Foundation: Grounded in parameter distribution analysis and Zipf distribution characteristics
Clever Method Design: Common act² elegantly combines vocabulary and FFN pruning
Comprehensive Experiments: Systematic evaluation across multiple model families, scales, and tasks
High Practical Value: Architecture compatibility enables easy deployment

Weaknesses

Limited Novelty: Both vocabulary and FFN pruning are existing techniques; main contribution lies in their combination
Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why this combination is effective
Limited Inference Acceleration: Falls short of depth pruning methods on key performance metrics (throughput)

Impact

Academic Contribution: Provides new perspective on LLM pruning, particularly scale-adaptive approaches
Practical Value: Simple and effective method, easy to implement and deploy
Reproducibility: Authors commit to open-sourcing code, facilitating method adoption

Applicable Scenarios

Edge Deployment: Model compression in memory-constrained environments
Multi-scale Deployment: Scenarios requiring simultaneous support for small and large models
Rapid Pruning: Applications requiring model compression in short timeframes

References

The paper cites extensive related work, primarily including:

Quantization Methods: GPTQ (Frantar et al., 2022), AWQ (Lin et al., 2024)
Depth Pruning: Shortened LLaMA (Kim et al., 2024), LaCo (Yang et al., 2024)
Width Pruning: SliceGPT (Ashkboos et al., 2024), FLAP (An et al., 2024)
Vocabulary Processing: Related multilingual and domain-specific vocabulary pruning work

Overall Assessment: This is a technically sound and highly practical paper. While relatively limited in theoretical innovation, it contributes an effective and easily deployable solution to LLM pruning through clever method combination and comprehensive experimental validation. Its particular advantages in small language model pruning and architecture compatibility position it well for practical applications.