Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
- Paper ID: 2505.16743
- Title: TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
- Authors: Florentin Beck (University of Tübingen), William Rudman (University of Texas at Austin), Carsten Eickhoff (University of Tübingen)
- Categories: cs.CL cs.AI cs.LG
- Publication Date: October 11, 2025 (arXiv v2)
- Paper Link: https://arxiv.org/abs/2505.16743
- Code Link: https://github.com/flobk/TRIM
Large Language Models (LLMs) face significant computational and memory challenges due to their massive parameter scale, making model pruning essential for efficient deployment. Existing one-shot pruning methods typically apply uniform sparsity constraints across layers or within layers, exhibiting poor performance at high sparsity rates. This paper proposes TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies different sparsity rates to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-level sparsity allocation, focusing on reducing variance in quality preservation across outputs to retain critical information. TRIM seamlessly integrates with existing layer-level pruning strategies. Perplexity and zero-shot task evaluations across multiple LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves state-of-the-art results and enhanced stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods.
As the parameter scale of large language models grows exponentially, model deployment faces severe memory and computational resource constraints. While parameter growth brings performance improvements and emergent capabilities, it makes inference in resource-limited environments challenging.
- Uniform Sparsity Constraints: Existing one-shot pruning methods (e.g., Wanda, OWL, AlphaPruning) typically apply identical sparsity rates across all layers or all output dimensions within layers
- Sharp Performance Degradation at High Sparsity: At extreme sparsity levels (>70%), uniform strategies lead to significant performance deterioration
- Neglect of Dimension Heterogeneity: Different output dimensions exhibit significant variations in pruning sensitivity and importance
The paper observes that LLMs possess unique weight and activation characteristics, such as prominent outlier features and highly skewed activation distributions. These characteristics suggest that different output dimensions within layers have varying pruning sensitivities, necessitating more fine-grained sparsity allocation strategies.
- First Dimension-Level Sparsity Allocation: Proposes the first algorithm to compute different sparsity rates for individual output dimensions within each layer
- SOTA Performance at Extreme Sparsity: At 80% sparsity, significantly reduces perplexity compared to existing methods (48% reduction for Qwen2.5-14B, 90%+ for OPT-13B)
- In-depth Empirical Analysis: Reveals heterogeneity in output dimensions regarding pruning sensitivity and downstream task importance
- Plug-and-Play Design: TRIM integrates with any importance-score-based pruning algorithm, demonstrating good generalizability
Given a weight matrix W ∈ R^(D×N), where D is the number of output dimensions and N is the number of input dimensions, the objective is to determine optimal sparsity rates Si for each output dimension Wi,: to maximize overall layer quality while satisfying average sparsity constraints.
TRIM defines a dimension-level sparsity vector S = S1, S2, ..., SD, where Si ∈ 0,1 specifies the target sparsity rate for the i-th output dimension. The constraint is:
where T is the target sparsity rate for the layer.
Algorithm 1: Iterative Dimension-Wise Sparsity Adjustment
- Initialization: Compute unpruned output Y ← WX, initialize Si = T (uniform distribution)
- Iterative Optimization (K iterations):
- Prune according to current S to obtain Wpruned
- Compute pruned output Ŷ ← WprunedX
- Evaluate overall quality qk ← Qmetric(Y, Ŷ)
- Update best configuration (if qk > qbest)
- Compute dimension-wise quality ci ← QmetricDimwise(Yi,:, Ŷi,:)
- Normalize quality scores to 0,1 range
- Adjust sparsity rates based on learning rate α: δi ← αc'i
- Re-center to maintain average constraint: Si ← δi - (1/D)Σδj + T
- Return: Optimal sparsity allocation Sbest
- Layer-Level Quality: Uses cosine similarity to evaluate pruning quality across the entire layer
- Dimension-Level Quality: Computes cosine similarity for each output dimension to guide sparsity rate adjustment
- Adaptive Learning Rate: Supports both positive and negative learning rates; positive learning rates reduce quality variance, while negative learning rates apply to layers with concentrated outliers
- Quality Variance Minimization: Enhances overall performance by reducing variance in quality degradation across dimensions
- Compatibility Design: Integrates with existing scoring rules (Wanda, Magnitude, SparseGPT, GBLM)
- Models: Qwen2.5 (3B/7B/14B/32B/72B), LLaMA-2 (7B/13B), OPT (6.7B/13B)
- Evaluation Data: WikiText validation set (perplexity), C4 and Pile (generalization verification)
- Downstream Tasks: BoolQ, RTE, HellaSwag, WinoGrande, ARC Easy/Challenge, OpenBookQA
- Perplexity: Evaluates language modeling capability on WikiText validation set
- Zero-Shot Accuracy: Average performance on 7 downstream tasks
- Baseline Methods: OWL, AlphaPruning (based on Wanda)
- Ablation Studies: Impact of different quality metrics, learning rate settings, and iteration counts
- Calibration Samples: Randomly selected from C4 dataset, sequence length 2048
- Sparsity Limits: Maximum 95% per dimension to prevent overfitting
- Hyperparameters: K=10 iterations, learning rate α determined via grid search
| Model | OWL Baseline | OWL+TRIM | Improvement |
|---|
| Qwen2.5-14B | 348.48 | 180.67 | -48% |
| OPT-13B | 6461.43 | 324.14 | -95% |
| LLaMA-2-13B | 225.04 | 154.83 | -31% |
TRIM achieves performance improvements across all tested models and sparsity levels, with average gains of 0.46-0.65 percentage points at 80% sparsity.
- Layer-Level Quality: Cosine similarity demonstrates the most stable performance
- Dimension-Level Quality: Cosine similarity proves more reliable than MSE and PSNR
TRIM shows improvements across different scoring rules (Magnitude, SparseGPT, GBLM), validating the method's generalizability.
Gini coefficient analysis reveals significant variations in importance score concentration across different output dimensions, leading to different pruning sensitivities.
Quality degradation accelerates as sparsity increases, making fine-grained allocation increasingly important.
Experiments show enormous variations in the impact of completely removing individual dimensions:
- Minimum L2 norm dimension: Perplexity increases by only 0.16
- Maximum L2 norm dimension: Perplexity surges to 273.10
- Gradient-Based Methods: SNIP, GraSP, SynFlow, etc., requiring gradient information and retraining
- One-Shot Pruning Methods: SparseGPT, Wanda, etc., requiring no retraining but with limited performance
- Layer-Adaptive Methods: OWL, AlphaPruning, etc., allocating different sparsity rates to different layers
TRIM is the first method to perform dimension-level sparsity allocation within layers, filling the gap in fine-grained control of existing methods.
- Necessity of Dimension-Level Sparsity Allocation: At extreme sparsity levels, fine-grained control is crucial for maintaining model performance
- Effectiveness of Quality Variance Minimization: Balancing quality degradation across dimensions significantly enhances overall performance
- Method Generalizability: TRIM integrates with multiple existing pruning algorithms, demonstrating good extensibility
- Complexity of Learning Rate Selection: Layers with concentrated outliers require negative learning rates, increasing hyperparameter tuning complexity
- Unstructured Sparsity: Current method does not directly support structured sparsity patterns like n:m
- Computational Overhead: Iterative process adds approximately 8% runtime overhead
- Structured Sparsity Support: Extend TRIM to support hardware-friendly sparsity patterns
- Automatic Learning Rate Selection: Develop adaptive mechanisms to reduce hyperparameter tuning requirements
- Theoretical Analysis: Establish theoretical frameworks for dimension importance and pruning sensitivity
- Strong Novelty: First to propose dimension-level sparsity allocation with innovative approach
- Comprehensive Experiments: Validates method effectiveness across multiple model families and tasks
- Theoretical Support: In-depth analysis reveals fundamental reasons for method effectiveness
- High Practical Value: Plug-and-play design facilitates easy integration into existing systems
- Method Complexity: Increases algorithmic complexity and hyperparameters compared to baseline methods
- Hardware Adaptability: Unstructured sparsity limits acceleration effects on specialized hardware
- Insufficient Theoretical Analysis: Lacks theoretical guarantees for optimal sparsity allocation
- Academic Contribution: Provides new research directions for LLM pruning
- Practical Value: Significant for deploying large models in resource-constrained environments
- Reproducibility: Open-source code facilitates subsequent research
- Extreme Sparsity Requirements: Particularly suitable for scenarios requiring >70% sparsity
- Resource-Constrained Environments: Edge devices, mobile platforms, and other computationally limited settings
- Research Purposes: Provides new benchmarks and insights for pruning algorithm research
The paper cites important works in the pruning domain, including:
- Classical pruning methods: Le Cun et al. (1989), Han et al. (2015)
- Modern LLM pruning: Sun et al. (2024) Wanda, Frantar and Alistarh (2023) SparseGPT
- Layer-adaptive methods: Yin et al. (2024) OWL, Lu et al. (2024) AlphaPruning
Summary: TRIM significantly improves LLM pruning performance at extreme sparsity levels by introducing dimension-level sparsity allocation. This method possesses important theoretical value and practical significance, opening new research directions in large model compression. Despite certain limitations, its innovation and effectiveness make it an important contribution to the field.