PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models
Zou, Yin, Pei et al.
Channel permutation is a powerful technique for enhancing the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the retention of important weights. However, traditional channel permutation methods rely on handcrafted quality metrics, which often fail to accurately capture the true impact of pruning on model performance. To address this limitation, we propose PermLLM, a novel post-training pruning framework that introduces learnable channel permutation (LCP) for N:M sparsity. LCP leverages Sinkhorn normalization to transform discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Additionally, PermLLM incorporates an efficient block-wise channel permutation strategy, which significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods to adaptively optimize channel permutations, effectively mitigating pruning-induced errors. Extensive experiments on the LLaMA series, Qwen, and OPT models demonstrate that PermLLM achieves superior performance in optimizing N:M sparse models. The code is available at https://github.com/lanchengzou/PermLLM.
academic
PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models
Channel permutation is a powerful technique that improves the accuracy of N:M sparse models by reordering the channels of weight matrices to prioritize the preservation of important weights. However, traditional channel permutation methods rely on hand-crafted quality metrics that often fail to accurately capture the true impact of pruning on model performance. To address this limitation, this paper proposes PermLLM, a post-training pruning framework for N:M sparsity that introduces Learnable Channel Permutation (LCP). LCP leverages Sinkhorn normalization to convert discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Furthermore, PermLLM employs an efficient block-wise channel permutation strategy that significantly reduces the number of learnable parameters and computational complexity. PermLLM seamlessly integrates with existing one-shot pruning methods and adaptively optimizes channel permutation to effectively mitigate pruning-induced errors.
Core Problem: Traditional channel permutation methods use hand-crafted quality metrics (such as the sum of retained weight importance) to evaluate permutation schemes, but these metrics diverge from actual pruning errors.
Significance: With the rapid growth in large language model scale, model compression techniques (such as pruning) are crucial for efficient deployment. N:M sparsity has gained considerable attention due to its hardware-friendly nature (supported by NVIDIA Sparse Tensor Cores).
Existing Limitations:
Hand-crafted quality metrics fail to accurately reflect the true impact of pruning on model performance
Traditional methods cannot adequately capture complex inter-layer interactions
The optimization space is enormous (Cin! possible permutations for Cin input channels)
The paper demonstrates the problem through a concrete example (Figure 1): channel permutations that maximize importance scores may lead to larger output errors, indicating a fundamental discrepancy between hand-crafted metrics and actual performance.
First Introduction of Learnable Channel Permutation (LCP): Transforms the discrete channel permutation problem into a differentiable optimization problem, enabling end-to-end learning.
Sinkhorn Normalization Technique: Leverages Sinkhorn normalization to relax discrete permutation matrices into soft permutation matrices, addressing the non-differentiability of permutation matrices.
Block-wise Channel Permutation Strategy: Significantly reduces parameter complexity from O(C²ᵢₙ) to O(Cᵢₙ×B) and computational complexity from O(C³ᵢₙ) to O(Cᵢₙ×B²).
Given a pre-trained weight matrix W ∈ R^(Cout×Cin), the objective is to find the optimal permutation matrix P such that the reordered weight matrix Ŵ = WP minimizes the output difference from the original dense model after applying N:M sparsity.
where Tr and Tc denote row and column normalization operations respectively, and τ is a temperature parameter controlling the hardness of the soft permutation matrix.
To reduce computational complexity, channels are partitioned into multiple blocks of size B, with permutation performed independently within each block:
PB = diag(P₁, P₂, ..., PNB)
ŴB = WPB
Parameter count is reduced from C²ᵢₙ to Cᵢₙ×B, and computational complexity from O(C³ᵢₙ) to O(Cᵢₙ×B²).
PermLLM can be integrated with any importance-metric-based one-shot pruning method. For a given importance matrix S, the permuted importance matrix is Ŝ = SPB, and the mask is obtained as:
argmax M ∑∑ (M⊙Ŝ)i,kM:(k+1)M
STE is used to handle the non-differentiability of argmax.
Using custom CUDA kernels, the channel permutation operation achieves 84× speedup compared to PyTorch implementation, with overall inference speed improvement of approximately 1.67×.
The paper provides mask visualizations (Figure 3), showing that the permutations learned by PermLLM produce different weight retention patterns compared to traditional methods, validating the effectiveness of end-to-end optimization.
Computational Overhead: Despite the block-wise strategy significantly reducing complexity, the method still requires more computational resources compared to traditional approaches
Application Scope: The method is specifically designed for semi-structured pruning; its application to other compression tasks (such as quantization) remains to be explored
Convergence: Larger block sizes require more iterations to converge
Optimization theory foundations (Sinkhorn normalization, Hungarian algorithm, etc.)
Overall Assessment: This is a high-quality paper with strong technical innovation, comprehensive experiments, and refined engineering implementation. By transforming discrete optimization problems into continuous optimization problems, it brings breakthrough progress to channel permutation techniques. Despite limitations in computational overhead and applicability scope, its contributions to large language model compression are significant, with important academic and practical value.