COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
Kwek, Yin
Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
academic
COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
This paper proposes COMPACT, a pruning method to address efficiency optimization of Large Language Models (LLMs) in terms of memory, latency, and serving costs. The method combines vocabulary pruning with FFN channel pruning weighted by common tokens, achieving parameter compression while maintaining the standard transformer architecture. The approach is validated on model families including Qwen, LLaMA, and Gemma (0.5B-70B parameters).
Although Large Language Models demonstrate excellent performance across various NLP tasks, their enormous parameter counts (billions to hundreds of billions) result in prohibitively high deployment costs, limiting their application in edge devices, interactive applications, and large-scale inference.
Significant differences in parameter distribution across models of different scales: vocabulary parameters dominate in small models, while FFN parameters dominate in large models
Natural language follows Zipf distribution, with rare tokens appearing with extremely low frequency and contributing minimally to downstream performance
Attention Parameters: Nattention=2LD2(1+H1) (H as head ratio)
As model scale increases, NFFN and Nattention grow as O(LD2), while Nvocab grows only as O(D). Therefore, vocabulary parameters constitute a larger proportion in small models.
Algorithm 1 COMPACT
Input: Model M, calibration dataset D, target vocabulary size V', target intermediate dimension I'
1. Identify set S of rarest V-V' tokens
2. Run forward pass on dataset D, collect squared activations
3. For each channel k, compute importance Ik using common act²
4. For each layer: prune I-I' least important channels
5. Prune vocabulary parameters: remove last V-V' rows from embedding and LM head matrices
6. Return pruned model M'
Smooth Degradation: COMPACT exhibits smooth performance decay, while depth pruning methods show sudden performance jumps
Architecture Agnosticism: COMPACT can be directly applied to new architectures like Gemma 3, while other methods require architecture-specific modifications
Limited Impact of Rare Tokens: 67% vocabulary reduction affects only 4% of text retokenization
The paper cites extensive related work, primarily including:
Quantization Methods: GPTQ (Frantar et al., 2022), AWQ (Lin et al., 2024)
Depth Pruning: Shortened LLaMA (Kim et al., 2024), LaCo (Yang et al., 2024)
Width Pruning: SliceGPT (Ashkboos et al., 2024), FLAP (An et al., 2024)
Vocabulary Processing: Related multilingual and domain-specific vocabulary pruning work
Overall Assessment: This is a technically sound and highly practical paper. While relatively limited in theoretical innovation, it contributes an effective and easily deployable solution to LLM pruning through clever method combination and comprehensive experimental validation. Its particular advantages in small language model pruning and architecture compatibility position it well for practical applications.