2025-11-10T03:09:53.117606

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Kwek, Yin
Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
academic

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Basic Information

  • Paper ID: 2509.06836
  • Title: COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
  • Authors: Eugene Kwek, Wenpeng Yin (Penn State University)
  • Categories: cs.CL cs.AI cs.LG
  • Publication Status: Preprint under review
  • Paper Link: https://arxiv.org/abs/2509.06836v3

Abstract

This paper proposes COMPACT, a pruning method to address efficiency optimization of Large Language Models (LLMs) in terms of memory, latency, and serving costs. The method combines vocabulary pruning with FFN channel pruning weighted by common tokens, achieving parameter compression while maintaining the standard transformer architecture. The approach is validated on model families including Qwen, LLaMA, and Gemma (0.5B-70B parameters).

Research Background and Motivation

Problem Definition

Although Large Language Models demonstrate excellent performance across various NLP tasks, their enormous parameter counts (billions to hundreds of billions) result in prohibitively high deployment costs, limiting their application in edge devices, interactive applications, and large-scale inference.

Limitations of Existing Methods

  1. Width Pruning: Removes hidden dimensions or channels but disrupts the standard transformer architecture, requiring custom inference code
  2. Depth Pruning: Removes entire transformer blocks, preserving architecture but causing dramatic performance degradation
  3. Poor Scalability: Existing methods are effective on large models but perform poorly on Small Language Models (SLMs)
  4. Neglects Linguistic Properties: Fails to account for token importance differences, treating all tokens equally

Research Motivation

Through analysis, the authors discovered:

  • Significant differences in parameter distribution across models of different scales: vocabulary parameters dominate in small models, while FFN parameters dominate in large models
  • Natural language follows Zipf distribution, with rare tokens appearing with extremely low frequency and contributing minimally to downstream performance

Core Contributions

  1. Systematic Analysis: First systematic analysis of embedding, FFN, and attention parameter distributions across LLMs of different scales
  2. COMPACT Method: Proposes a novel framework combining vocabulary pruning and common-token-weighted FFN pruning
  3. Architecture Compatibility: Maintains standard transformer architecture, compatible with existing inference frameworks
  4. Scale Adaptivity: Achieves SOTA performance across multiple model families from 0.5B to 70B parameters

Method Details

Parameter Distribution Analysis

The authors first analyze parameter distribution in modern decoder-only transformers:

  • Vocabulary Parameters: Nvocab=2VDN_{vocab} = 2VD (embedding and LM head layers)
  • FFN Parameters: NFFN=3LDIN_{FFN} = 3LDI (L layers, intermediate dimension I)
  • Attention Parameters: Nattention=2LD2(1+1H)N_{attention} = 2LD^2(1 + \frac{1}{H}) (H as head ratio)

As model scale increases, NFFNN_{FFN} and NattentionN_{attention} grow as O(LD2)O(LD^2), while NvocabN_{vocab} grows only as O(D)O(D). Therefore, vocabulary parameters constitute a larger proportion in small models.

COMPACT Architecture

1. Vocabulary Pruning

  • Principle: Based on BPE tokenizer following Zipf distribution, removes the rarest VVV-V' tokens
  • Implementation: Directly deletes corresponding rows from embedding and LM head matrices, as well as merge rules in the tokenizer
  • Advantages: Requires no calibration data, computationally efficient

2. Common-Token-Weighted FFN Pruning

Traditional act² method computes channel importance as: Ik=i=1N(SiLU(XiWgate)XiWup)k2I_k = \sum_{i=1}^{N} (SiLU(X_iW_{gate})X_iW_{up})^2_k

COMPACT's proposed common act² method: Ik=i=1Nwi(SiLU(XiWgate)XiWup)k2,wi={0xiS1otherwiseI_k = \sum_{i=1}^{N} w_i(SiLU(X_iW_{gate})X_iW_{up})^2_k, \quad w_i = \begin{cases} 0 & x_i \in S \\ 1 & \text{otherwise} \end{cases}

where SS is the set of rare tokens to be pruned.

Algorithm Flow

Algorithm 1 COMPACT
Input: Model M, calibration dataset D, target vocabulary size V', target intermediate dimension I'
1. Identify set S of rarest V-V' tokens
2. Run forward pass on dataset D, collect squared activations
3. For each channel k, compute importance Ik using common act²
4. For each layer: prune I-I' least important channels
5. Prune vocabulary parameters: remove last V-V' rows from embedding and LM head matrices
6. Return pruned model M'

Technical Innovations

  1. Dual Pruning Strategy: Combines vocabulary and FFN pruning, targeting parameter distribution characteristics of different-scale models
  2. Common-Token Weighting: FFN pruning considers only tokens that remain valid after pruning, avoiding misleading guidance from rare tokens
  3. Architecture Preservation: Only prunes vocabulary size and intermediate dimension, maintaining standard transformer structure
  4. Scale Adaptivity: Adapts to different-scale requirements by adjusting two hyperparameters VV' and II'

Experimental Setup

Evaluated Models

  • Small Language Models: Qwen 2.5-0.5B, LLaMA 3.2-1B, Gemma 3-1B
  • Large Language Models: LLaMA 3.1-8B, LLaMA 3.1-70B

Datasets and Tasks

  • Calibration Data: 256 samples from C4 dataset
  • Evaluation Tasks: MMLU, HellaSwag, WinoGrande, ARC-C/E, PIQA, GSM8K

Baseline Methods

  • Depth Pruning: ShortGPT, LaCo
  • Width Pruning: SliceGPT, 2SSP, FLAP

Evaluation Metrics

  • Parameter pruning ratio, average accuracy, relative performance retention
  • Pruning time, inference throughput, GPU memory usage

Experimental Results

Main Results

Small Language Model Performance

On Qwen 2.5-0.5B at 35% pruning ratio:

  • COMPACT: Average accuracy 35.3% (70.4% relative performance)
  • Best baseline: 31.4% (62.5% relative performance)

On LLaMA 3.2-1B at 35% pruning ratio:

  • COMPACT: Average accuracy 36.9% (76.4% relative performance)
  • Best baseline: 33.6% (69.6% relative performance)

Large Language Model Performance

On LLaMA 3.1-70B at 35% pruning ratio:

  • COMPACT: Average accuracy 63.7% (80.2% relative performance)
  • 2SSP: 62.8% (79.1% relative performance)

Efficiency Analysis

Pruning Time Comparison (LLaMA 3.1-8B, 35% pruning)

  • COMPACT: 0:32
  • 2SSP: 1:26
  • SliceGPT: 10:48

Inference Efficiency (LLaMA 3.1-8B, 35% pruning)

  • Memory Usage: COMPACT reduces 36% (best), ShortGPT/LaCo reduce 25%
  • Throughput Improvement: COMPACT improves 37%, ShortGPT/LaCo improve 57%

Ablation Studies

Common act² Effectiveness

On Qwen 2.5-0.5B at 35% pruning:

  • Common act²: 70.4% relative performance
  • Standard act²: 69.2% relative performance
  • |act| method: 67.6% relative performance

Vocabulary-FFN Trade-off Analysis

Fixed 37% pruning ratio, different VV' and II' combinations:

  • Pure FFN pruning (V'=151936): 63.0% relative performance
  • Optimal combination (V'=49536): 70.4% relative performance

Key Findings

  1. Smooth Degradation: COMPACT exhibits smooth performance decay, while depth pruning methods show sudden performance jumps
  2. Architecture Agnosticism: COMPACT can be directly applied to new architectures like Gemma 3, while other methods require architecture-specific modifications
  3. Limited Impact of Rare Tokens: 67% vocabulary reduction affects only 4% of text retokenization

Depth Pruning

  • Representative Methods: Shortened LLaMA, SLEB, LLM-Streamline
  • Advantages: Maintains standard architecture, significant inference acceleration
  • Disadvantages: Coarse-grained removal causes dramatic performance degradation

Width Pruning

  • Representative Methods: LLM-Pruner, SliceGPT, FLAP, 2SSP
  • Advantages: Fine-grained control, relatively smooth performance decay
  • Disadvantages: Disrupts standard architecture, requires custom inference code

Vocabulary Pruning

  • Existing Work: Primarily focused on language/domain-specific vocabulary trimming
  • This Paper's Contribution: General-purpose LLM vocabulary pruning, forming a complete framework combined with FFN pruning

Conclusions and Discussion

Main Conclusions

  1. COMPACT achieves SOTA pruning performance across multiple model families and scales
  2. The method maintains standard transformer architecture with good deployment compatibility
  3. The dual pruning strategy effectively adapts to parameter distribution characteristics of different-scale models

Limitations

  1. Limited Throughput Improvement: Compared to depth pruning methods, still lags in inference throughput gains
  2. Domain Adaptivity of Vocabulary Pruning: May require retaining more domain-specific vocabulary in specialized domains
  3. Hyperparameter Tuning: Requires finding optimal VV' and II' combinations for different pruning ratios

Future Directions

The authors propose the need to further narrow the throughput gap between width and depth pruning methods.

In-Depth Evaluation

Strengths

  1. Solid Theoretical Foundation: Grounded in parameter distribution analysis and Zipf distribution characteristics
  2. Clever Method Design: Common act² elegantly combines vocabulary and FFN pruning
  3. Comprehensive Experiments: Systematic evaluation across multiple model families, scales, and tasks
  4. High Practical Value: Architecture compatibility enables easy deployment

Weaknesses

  1. Limited Novelty: Both vocabulary and FFN pruning are existing techniques; main contribution lies in their combination
  2. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why this combination is effective
  3. Limited Inference Acceleration: Falls short of depth pruning methods on key performance metrics (throughput)

Impact

  1. Academic Contribution: Provides new perspective on LLM pruning, particularly scale-adaptive approaches
  2. Practical Value: Simple and effective method, easy to implement and deploy
  3. Reproducibility: Authors commit to open-sourcing code, facilitating method adoption

Applicable Scenarios

  1. Edge Deployment: Model compression in memory-constrained environments
  2. Multi-scale Deployment: Scenarios requiring simultaneous support for small and large models
  3. Rapid Pruning: Applications requiring model compression in short timeframes

References

The paper cites extensive related work, primarily including:

  • Quantization Methods: GPTQ (Frantar et al., 2022), AWQ (Lin et al., 2024)
  • Depth Pruning: Shortened LLaMA (Kim et al., 2024), LaCo (Yang et al., 2024)
  • Width Pruning: SliceGPT (Ashkboos et al., 2024), FLAP (An et al., 2024)
  • Vocabulary Processing: Related multilingual and domain-specific vocabulary pruning work

Overall Assessment: This is a technically sound and highly practical paper. While relatively limited in theoretical innovation, it contributes an effective and easily deployable solution to LLM pruning through clever method combination and comprehensive experimental validation. Its particular advantages in small language model pruning and architecture compatibility position it well for practical applications.