2025-11-17T08:49:21.061208

PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

Zou, Yin, Pei et al.
Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the retention of important weights. However, traditional channel permutation methods rely on handcrafted quality metrics, which often fail to accurately capture the true impact of pruning on model performance. To address this limitation, we propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation (LCP) for N:M sparsity. LCP leverages Sinkhorn normalization to transform discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Additionally, PermLLM incorporates an efficient block-wise channel permutation strategy, which significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods to adaptively optimize channel permutations, effectively mitigating pruning-induced errors. Extensive experiments on the LLaMA series, Qwen, and OPT models demonstrate that PermLLM achieves superior performance in optimizing N:M sparse models. The code is available at https://github.com/lanchengzou/PermLLM.
academic

PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

Basic Information

  • Paper ID: 2510.10136
  • Title: PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models
  • Authors: Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu (Chinese University of Hong Kong)
  • Classification: cs.LG cs.AI
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.10136
  • Code Link: https://github.com/lanchengzou/PermLLM

Abstract

Channel permutation is a powerful technique that improves the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the preservation of important weights. However, traditional channel permutation methods rely on hand-crafted quality metrics that often fail to accurately capture the true impact of pruning on model performance. To address this limitation, this paper proposes PermLLM, a post-training pruning framework for N:M sparsity that introduces Learnable Channel Permutation (LCP). LCP leverages Sinkhorn normalization to convert discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Furthermore, PermLLM employs an efficient block-wise channel permutation strategy that significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods and adaptively optimizes channel permutation to effectively mitigate pruning-induced errors.

Research Background and Motivation

Problem Definition

  1. Core Problem: Traditional channel permutation methods use hand-crafted quality metrics (such as the sum of retained weight importance) to evaluate permutation schemes, but these metrics diverge from actual pruning errors.
  2. Significance: With the rapid growth in large language model scale, model compression techniques (such as pruning) are crucial for efficient deployment. N:M sparsity has gained considerable attention due to its hardware-friendly nature (supported by NVIDIA Sparse Tensor Cores).
  3. Existing Limitations:
    • Hand-crafted quality metrics fail to accurately reflect the true impact of pruning on model performance
    • Traditional methods cannot adequately capture complex inter-layer interactions
    • The optimization space is enormous (Cin! possible permutations for Cin input channels)

Research Motivation

The paper demonstrates the problem through a concrete example (Figure 1): channel permutations that maximize importance scores may lead to larger output errors, indicating a fundamental discrepancy between hand-crafted metrics and actual performance.

Core Contributions

  1. First Introduction of Learnable Channel Permutation (LCP): Transforms the discrete channel permutation problem into a differentiable optimization problem, enabling end-to-end learning.
  2. Sinkhorn Normalization Technique: Leverages Sinkhorn normalization to relax discrete permutation matrices into soft permutation matrices, addressing the non-differentiability of permutation matrices.
  3. Block-wise Channel Permutation Strategy: Significantly reduces parameter complexity from O(C²ᵢₙ) to O(Cᵢₙ×B) and computational complexity from O(C³ᵢₙ) to O(Cᵢₙ×B²).
  4. Universal Framework Design: Seamlessly integrates with existing one-shot pruning methods (Wanda, RIA, etc.).
  5. Superior Experimental Performance: Validates the method's effectiveness across multiple models including LLaMA series, Qwen, and OPT.

Methodology Details

Task Definition

Given a pre-trained weight matrix W ∈ R^(Cout×Cin), the objective is to find the optimal permutation matrix P such that the reordered weight matrix Ŵ = WP minimizes the output difference from the original dense model after applying N:M sparsity.

Core Technical Architecture

1. Soft Permutation Matrix Relaxation

Relaxes the hard permutation matrix P into a soft permutation matrix P̂:

S₀(X) = exp(X)
Sᵢ(X) = Tc(Tr(Sᵢ₋₁(X)))
S(X) = lim(l→∞) Sl(X)
P̂ = SL(WP/τ)

where Tr and Tc denote row and column normalization operations respectively, and τ is a temperature parameter controlling the hardness of the soft permutation matrix.

2. Hardening Process and Gradient Approximation

During forward propagation, the soft permutation matrix is hardened into a strict permutation matrix via the Hungarian algorithm:

P = argmax P∈P Tr(P⊤P̂)

During backpropagation, the Straight-Through Estimator (STE) is used to approximate gradients: ∂P/∂P̂ = 1.

3. Block-wise Channel Permutation

To reduce computational complexity, channels are partitioned into multiple blocks of size B, with permutation performed independently within each block:

PB = diag(P₁, P₂, ..., PNB)
ŴB = WPB

Parameter count is reduced from C²ᵢₙ to Cᵢₙ×B, and computational complexity from O(C³ᵢₙ) to O(Cᵢₙ×B²).

Optimization Objective

PermLLM directly minimizes the cosine similarity loss between dense and sparse model outputs:

Lcosine(y, ỹ) = 1 - (y·ỹ)/(||y||·||ỹ||)

Integration with Existing Pruning Methods

PermLLM can be integrated with any importance-metric-based one-shot pruning method. For a given importance matrix S, the permuted importance matrix is Ŝ = SPB, and the mask is obtained as:

argmax M ∑∑ (M⊙Ŝ)i,kM:(k+1)M

STE is used to handle the non-differentiability of argmax.

Experimental Setup

Datasets and Models

  • Models: LLaMA 7B-13B, LLaMA-2 7B-13B, LLaMA-3.1 8B, Qwen-2.5 7B, OPT 6.7B
  • Calibration Data: 128 randomly selected samples from the C4 dataset, each containing 1024 tokens
  • Evaluation Tasks:
    • Language Modeling: Wikitext2 (perplexity)
    • Zero-shot Tasks: HellaSwag, ARC-Easy/Challenge, OpenBookQA, RTE

Comparison Methods

  • Baseline Methods: SparseGPT, Wanda, RIA
  • Traditional Channel Permutation: Wanda+CP, RIA+CP
  • Proposed Method: PermLLMWanda, PermLLMRIA

Implementation Details

  • Optimizer: AdamW
  • Learning Rate: {1e-3, 5e-3}
  • Sinkhorn Iterations: 5
  • Temperature Parameter: Linearly decayed from 1 to 0.1
  • Block Size: 64
  • Training Time: ~2.5 hours for 7B models (4 GPUs), ~5.5 hours for 13B models (8 GPUs)

Experimental Results

Main Results

Language Modeling Performance (Wikitext2 Perplexity)

MethodLLaMA 7BLLaMA-2 7BLLaMA-3.1 8BQwen-2.5 7B
Dense5.685.476.247.74
Wanda11.5912.1623.4224.44
Wanda+CP11.0711.0021.0918.76
PermLLMWanda9.419.3914.0313.58
RIA+CP10.9910.2619.8017.58
PermLLMRIA9.959.6015.7915.93

Zero-shot Task Average Accuracy

ModelWandaWanda+CPPermLLMWandaImprovement
LLaMA 7B41.3743.9445.67+4.3%
LLaMA-2 7B42.1243.4446.59+4.47%
LLaMA-3.1 8B38.9140.7243.33+4.42%

Inference Acceleration

Using custom CUDA kernels, the channel permutation operation achieves 84× speedup compared to PyTorch implementation, with overall inference speed improvement of approximately 1.67×.

Ablation Studies

Impact of Sinkhorn Normalization Iterations

Experiments demonstrate that 5 Sinkhorn normalization iterations achieve a good performance balance.

Impact of Block Size

Block SizeAverage AccuracyWikitext2 PerplexityTraining Time
3243.589.502h
6446.599.392.5h
12847.099.076h

Block size 64 provides the best balance between performance and efficiency.

Calibration Dataset Robustness

Experiments on different calibration datasets (Pile, Wikitext2, C4) demonstrate good robustness of the method.

Case Analysis

The paper provides mask visualizations (Figure 3), showing that the permutations learned by PermLLM produce different weight retention patterns compared to traditional methods, validating the effectiveness of end-to-end optimization.

Large Language Model Pruning

  • Structured Pruning: Removes coarse-grained structures (channels, layers, blocks)
  • Unstructured Pruning: Most flexible but difficult to accelerate in hardware
  • Semi-structured Pruning: N:M sparsity balances flexibility and hardware-friendliness

Channel Permutation Techniques

  • Early work primarily focused on exhaustive search for small-scale networks
  • RIA proposed heuristic channel assignment methods
  • This paper is the first to introduce learnable end-to-end optimization

N:M Sparsity Learning

  • Methods like SR-STE train N:M sparse models from scratch
  • Methods like MaskLLM learn semi-structured sparsity
  • This paper focuses on post-training pruning scenarios

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: PermLLM significantly outperforms traditional channel permutation methods across multiple models and tasks
  2. Generality: Seamlessly integrates with existing pruning methods
  3. Practicality: Achieves practical computational efficiency through block-wise strategy and custom CUDA kernels

Limitations

  1. Computational Overhead: Despite the block-wise strategy significantly reducing complexity, the method still requires more computational resources compared to traditional approaches
  2. Application Scope: The method is specifically designed for semi-structured pruning; its application to other compression tasks (such as quantization) remains to be explored
  3. Convergence: Larger block sizes require more iterations to converge

Future Directions

  1. Explore applications in other model compression tasks such as quantization
  2. Further improve training efficiency
  3. Investigate more efficient partial-layer optimization strategies

In-Depth Evaluation

Strengths

  1. Strong Technical Innovation: First to transform channel permutation into an end-to-end learnable problem with novel technical approach
  2. Solid Theoretical Foundation: The combination of Sinkhorn normalization and STE is theoretically sound
  3. Comprehensive Experiments: Thorough evaluation across multiple models, datasets, and tasks
  4. Refined Engineering Implementation: Provides custom CUDA kernels with consideration for practical deployment
  5. Clear Writing: Well-structured paper with accurate technical descriptions

Weaknesses

  1. Computational Overhead: Despite block-wise strategy, training cost remains relatively high
  2. Insufficient Theoretical Analysis: Lacks convergence analysis and theoretical guarantees
  3. Limited Applicability: Primarily applicable to N:M sparsity; generalization remains to be verified
  4. Incomplete Baseline Comparison: Comparison with some recent pruning methods is insufficient

Impact

  1. Academic Value: Opens new technical pathways for channel permutation research
  2. Practical Value: Has direct application value in large language model compression
  3. Reproducibility: Provides complete code implementation and detailed experimental settings

Applicable Scenarios

  1. Large Language Model Deployment: Particularly suitable for N:M sparse deployment scenarios requiring hardware acceleration
  2. Resource-Constrained Environments: Pursuing higher compression quality when computational resources are sufficient
  3. Research Prototypes: Provides technical foundation for further pruning and compression research

References

The paper cites 66 related references, primarily covering:

  • Foundational large language model work (GPT, LLaMA, etc.)
  • Classical network pruning methods (Magnitude Pruning, SparseGPT, etc.)
  • N:M sparsity-related research (RIA, SR-STE, etc.)
  • Optimization theory foundations (Sinkhorn normalization, Hungarian algorithm, etc.)

Overall Assessment: This is a high-quality paper with strong technical innovation, comprehensive experiments, and refined engineering implementation. By transforming discrete optimization problems into continuous optimization problems, it brings breakthrough progress to channel permutation techniques. Despite limitations in computational overhead and applicability scope, its contributions to large language model compression are significant, with important academic and practical value.