2025-11-19T13:07:13.821194

TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Beck, Rudman, Eickhoff
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
academic

TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Basic Information

  • Paper ID: 2505.16743
  • Title: TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
  • Authors: Florentin Beck (University of Tübingen), William Rudman (University of Texas at Austin), Carsten Eickhoff (University of Tübingen)
  • Categories: cs.CL cs.AI cs.LG
  • Publication Date: October 11, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2505.16743
  • Code Link: https://github.com/flobk/TRIM

Abstract

Large Language Models (LLMs) face significant computational and memory challenges due to their massive parameter scale, making model pruning essential for efficient deployment. Existing one-shot pruning methods typically apply uniform sparsity constraints across layers or within layers, exhibiting poor performance at high sparsity rates. This paper proposes TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies different sparsity rates to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-level sparsity allocation, focusing on reducing variance in quality preservation across outputs to retain critical information. TRIM seamlessly integrates with existing layer-level pruning strategies. Perplexity and zero-shot task evaluations across multiple LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves state-of-the-art results and enhanced stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods.

Research Background and Motivation

Problem Definition

As the parameter scale of large language models grows exponentially, model deployment faces severe memory and computational resource constraints. While parameter growth brings performance improvements and emergent capabilities, it makes inference in resource-limited environments challenging.

Limitations of Existing Methods

  1. Uniform Sparsity Constraints: Existing one-shot pruning methods (e.g., Wanda, OWL, AlphaPruning) typically apply identical sparsity rates across all layers or all output dimensions within layers
  2. Sharp Performance Degradation at High Sparsity: At extreme sparsity levels (>70%), uniform strategies lead to significant performance deterioration
  3. Neglect of Dimension Heterogeneity: Different output dimensions exhibit significant variations in pruning sensitivity and importance

Research Motivation

The paper observes that LLMs possess unique weight and activation characteristics, such as prominent outlier features and highly skewed activation distributions. These characteristics suggest that different output dimensions within layers have varying pruning sensitivities, necessitating more fine-grained sparsity allocation strategies.

Core Contributions

  1. First Dimension-Level Sparsity Allocation: Proposes the first algorithm to compute different sparsity rates for individual output dimensions within each layer
  2. SOTA Performance at Extreme Sparsity: At 80% sparsity, significantly reduces perplexity compared to existing methods (48% reduction for Qwen2.5-14B, 90%+ for OPT-13B)
  3. In-depth Empirical Analysis: Reveals heterogeneity in output dimensions regarding pruning sensitivity and downstream task importance
  4. Plug-and-Play Design: TRIM integrates with any importance-score-based pruning algorithm, demonstrating good generalizability

Methodology Details

Task Definition

Given a weight matrix W ∈ R^(D×N), where D is the number of output dimensions and N is the number of input dimensions, the objective is to determine optimal sparsity rates Si for each output dimension Wi,: to maximize overall layer quality while satisfying average sparsity constraints.

Core Algorithm: TRIM

Dimension-Level Sparsity Vector

TRIM defines a dimension-level sparsity vector S = S1, S2, ..., SD, where Si ∈ 0,1 specifies the target sparsity rate for the i-th output dimension. The constraint is:

1/D * Σ(i=1 to D) Si = T

where T is the target sparsity rate for the layer.

Iterative Adjustment Algorithm

Algorithm 1: Iterative Dimension-Wise Sparsity Adjustment

  1. Initialization: Compute unpruned output Y ← WX, initialize Si = T (uniform distribution)
  2. Iterative Optimization (K iterations):
    • Prune according to current S to obtain Wpruned
    • Compute pruned output Ŷ ← WprunedX
    • Evaluate overall quality qk ← Qmetric(Y, Ŷ)
    • Update best configuration (if qk > qbest)
    • Compute dimension-wise quality ci ← QmetricDimwise(Yi,:, Ŷi,:)
    • Normalize quality scores to 0,1 range
    • Adjust sparsity rates based on learning rate α: δi ← αc'i
    • Re-center to maintain average constraint: Si ← δi - (1/D)Σδj + T
  3. Return: Optimal sparsity allocation Sbest

Quality Metrics

  • Layer-Level Quality: Uses cosine similarity to evaluate pruning quality across the entire layer
  • Dimension-Level Quality: Computes cosine similarity for each output dimension to guide sparsity rate adjustment

Technical Innovations

  1. Adaptive Learning Rate: Supports both positive and negative learning rates; positive learning rates reduce quality variance, while negative learning rates apply to layers with concentrated outliers
  2. Quality Variance Minimization: Enhances overall performance by reducing variance in quality degradation across dimensions
  3. Compatibility Design: Integrates with existing scoring rules (Wanda, Magnitude, SparseGPT, GBLM)

Experimental Setup

Datasets

  • Models: Qwen2.5 (3B/7B/14B/32B/72B), LLaMA-2 (7B/13B), OPT (6.7B/13B)
  • Evaluation Data: WikiText validation set (perplexity), C4 and Pile (generalization verification)
  • Downstream Tasks: BoolQ, RTE, HellaSwag, WinoGrande, ARC Easy/Challenge, OpenBookQA

Evaluation Metrics

  • Perplexity: Evaluates language modeling capability on WikiText validation set
  • Zero-Shot Accuracy: Average performance on 7 downstream tasks

Comparison Methods

  • Baseline Methods: OWL, AlphaPruning (based on Wanda)
  • Ablation Studies: Impact of different quality metrics, learning rate settings, and iteration counts

Implementation Details

  • Calibration Samples: Randomly selected from C4 dataset, sequence length 2048
  • Sparsity Limits: Maximum 95% per dimension to prevent overfitting
  • Hyperparameters: K=10 iterations, learning rate α determined via grid search

Experimental Results

Main Results

Perplexity Performance (80% Sparsity)

ModelOWL BaselineOWL+TRIMImprovement
Qwen2.5-14B348.48180.67-48%
OPT-13B6461.43324.14-95%
LLaMA-2-13B225.04154.83-31%

Zero-Shot Task Performance

TRIM achieves performance improvements across all tested models and sparsity levels, with average gains of 0.46-0.65 percentage points at 80% sparsity.

Ablation Studies

Quality Metric Comparison

  • Layer-Level Quality: Cosine similarity demonstrates the most stable performance
  • Dimension-Level Quality: Cosine similarity proves more reliable than MSE and PSNR

Generalization Across Different Pruning Metrics

TRIM shows improvements across different scoring rules (Magnitude, SparseGPT, GBLM), validating the method's generalizability.

Key Findings

Observation 1: Dimension Heterogeneity

Gini coefficient analysis reveals significant variations in importance score concentration across different output dimensions, leading to different pruning sensitivities.

Observation 2: Non-linear Quality Degradation

Quality degradation accelerates as sparsity increases, making fine-grained allocation increasingly important.

Observation 3: Dimension Importance Differences

Experiments show enormous variations in the impact of completely removing individual dimensions:

  • Minimum L2 norm dimension: Perplexity increases by only 0.16
  • Maximum L2 norm dimension: Perplexity surges to 273.10

Pruning Method Classification

  1. Gradient-Based Methods: SNIP, GraSP, SynFlow, etc., requiring gradient information and retraining
  2. One-Shot Pruning Methods: SparseGPT, Wanda, etc., requiring no retraining but with limited performance
  3. Layer-Adaptive Methods: OWL, AlphaPruning, etc., allocating different sparsity rates to different layers

TRIM's Positioning

TRIM is the first method to perform dimension-level sparsity allocation within layers, filling the gap in fine-grained control of existing methods.

Conclusions and Discussion

Main Conclusions

  1. Necessity of Dimension-Level Sparsity Allocation: At extreme sparsity levels, fine-grained control is crucial for maintaining model performance
  2. Effectiveness of Quality Variance Minimization: Balancing quality degradation across dimensions significantly enhances overall performance
  3. Method Generalizability: TRIM integrates with multiple existing pruning algorithms, demonstrating good extensibility

Limitations

  1. Complexity of Learning Rate Selection: Layers with concentrated outliers require negative learning rates, increasing hyperparameter tuning complexity
  2. Unstructured Sparsity: Current method does not directly support structured sparsity patterns like n:m
  3. Computational Overhead: Iterative process adds approximately 8% runtime overhead

Future Directions

  1. Structured Sparsity Support: Extend TRIM to support hardware-friendly sparsity patterns
  2. Automatic Learning Rate Selection: Develop adaptive mechanisms to reduce hyperparameter tuning requirements
  3. Theoretical Analysis: Establish theoretical frameworks for dimension importance and pruning sensitivity

In-Depth Evaluation

Strengths

  1. Strong Novelty: First to propose dimension-level sparsity allocation with innovative approach
  2. Comprehensive Experiments: Validates method effectiveness across multiple model families and tasks
  3. Theoretical Support: In-depth analysis reveals fundamental reasons for method effectiveness
  4. High Practical Value: Plug-and-play design facilitates easy integration into existing systems

Weaknesses

  1. Method Complexity: Increases algorithmic complexity and hyperparameters compared to baseline methods
  2. Hardware Adaptability: Unstructured sparsity limits acceleration effects on specialized hardware
  3. Insufficient Theoretical Analysis: Lacks theoretical guarantees for optimal sparsity allocation

Impact

  1. Academic Contribution: Provides new research directions for LLM pruning
  2. Practical Value: Significant for deploying large models in resource-constrained environments
  3. Reproducibility: Open-source code facilitates subsequent research

Applicable Scenarios

  1. Extreme Sparsity Requirements: Particularly suitable for scenarios requiring >70% sparsity
  2. Resource-Constrained Environments: Edge devices, mobile platforms, and other computationally limited settings
  3. Research Purposes: Provides new benchmarks and insights for pruning algorithm research

References

The paper cites important works in the pruning domain, including:

  • Classical pruning methods: Le Cun et al. (1989), Han et al. (2015)
  • Modern LLM pruning: Sun et al. (2024) Wanda, Frantar and Alistarh (2023) SparseGPT
  • Layer-adaptive methods: Yin et al. (2024) OWL, Lu et al. (2024) AlphaPruning

Summary: TRIM significantly improves LLM pruning performance at extreme sparsity levels by introducing dimension-level sparsity allocation. This method possesses important theoretical value and practical significance, opening new research directions in large model compression. Despite certain limitations, its innovation and effectiveness make it an important contribution to the field.