2025-11-20T05:49:14.768535

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure

Kang, Yin

Low-Rank Adaptation (LoRA) is a widely adopted technique for parameter-efficient fine-tuning, but its slow convergence has spurred the development of numerous variants. Nevertheless, existing methods often fail to improve performance, memory footprint, and computational efficiency simultaneously. To address this challenge, we revisit the causes of LoRA's slow convergence. Building on these insights, we propose Matrix Shard Sharing (MiSS), which updates shards of the original weight matrix using a single shared trainable matrix $\boldsymbol{D}$, initialized to zeros. To simultaneously ensure computational efficiency, low memory footprint, and scalable serving, we introduce MiSS$^e$. Both theoretical analysis and empirical results demonstrate that our method reduces optimization complexity without compromising performance, thereby achieving a more favorable trade-off among performance, memory, and efficiency. Furthermore, we conduct a comprehensive comparative analysis of various PEFT methods, evaluating their memory usage, initialization overhead, and computational efficiency. By mapping the Pareto frontier across these dimensions, we show that MiSS occupies a favorable position, effectively capturing the advantages of prior approaches.

academic

Basic Information

Paper ID: 2409.15371
Title: MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure
Authors: Jiale Kang (Yuanshi Inc), Qingyu Yin (Zhejiang University)
Classification: cs.CL cs.AI
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2409.15371v11

Abstract

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning technique, yet its slow convergence has motivated the development of numerous variants. However, existing methods often fail to simultaneously improve performance, memory consumption, and computational efficiency. To address this challenge, this paper revisits the fundamental causes of LoRA's slow convergence. Based on these insights, the authors propose the Matrix Shard Sharing (MiSS) method, which uses a single shared trainable matrix $\boldsymbol{D}$ (initialized to zero) to update shards of the original weight matrix. To simultaneously ensure computational efficiency, low memory consumption, and scalable deployment, the authors introduce MiSS $^e$ . Both theoretical analysis and experimental results demonstrate that the method reduces optimization complexity without compromising performance, thereby achieving a more favorable trade-off among performance, memory, and efficiency.

Research Background and Motivation

Problem Definition

Full-parameter fine-tuning of large language models (LLMs) is computationally prohibitive, thus motivating the development of parameter-efficient fine-tuning (PEFT) techniques. LoRA, as one of the most prominent PEFT methods, approximates weight updates through low-rank decomposition: $\Delta W \approx BA$ , where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d,k)$ .

Limitations of Existing Methods

Slow Convergence: LoRA converges significantly slower compared to full-parameter fine-tuning
Optimization Complexity: Requires simultaneous updates of matrices A and B, increasing optimization complexity
Difficult Trade-offs: Existing LoRA variants struggle to balance performance, memory, and efficiency simultaneously

Research Motivation

By analyzing methods such as S2FT and LoRA+, the authors identify that the key reason for LoRA's slow convergence is the need to optimize two matrices simultaneously. Based on the hypothesis that "training a single matrix can simplify optimization without sacrificing expressiveness," the authors propose the MiSS method.

Core Contributions

Proposes MiSS Method: An efficient and adaptive structure with shard-sharing mechanism that achieves effective balance among three key attributes: performance, memory efficiency, and computational efficiency
Theoretical and Experimental Validation: Validates MiSS's superiority across diverse datasets and model architectures through large-scale experiments
Comprehensive PEFT Method Comparison: Provides integrated assessment of multiple PEFT methods in terms of memory usage, initialization overhead, and computational efficiency
Pareto Frontier Analysis: Demonstrates MiSS's advantageous position by mapping the Pareto frontier across these dimensions

Method Details

Task Definition

Given a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$ , the objective is to learn a parameter-efficient update $\Delta W$ such that the fine-tuned model performs well on downstream tasks while minimizing the number of trainable parameters and computational overhead.

Model Architecture

MiSS Basic Form

MiSS defines weight updates as large matrices generated from small trainable matrices $D$ through an expansion operation:

$W = W_0 + \Delta W = W_0 + \text{expand}(D)$ $y = W_0x + \text{expand}(D)x$

where $D \in \mathbb{R}^{r_1 \times r_2}$ , $(r_1, r_2) \ll \min(d,k)$ .

Expansion Mechanism

The output dimension $d$ is partitioned into $N$ shards with sizes $\{s_1, s_2, \ldots, s_N\}$ , where $\sum_{i=1}^N s_i = d$ . For each shard $i$ , its update is determined by repeating the $i$ -th row $D_i$ of $D$ a total of $s_i$ times:

$(\text{expand}(D))^T = [(1_{s_1}D_1)^T \quad (1_{s_2}D_2)^T \quad \ldots \quad (1_{s_N}D_N)^T]$

MiSS $^e$ Efficient Implementation

To avoid explicitly forming large matrices, MiSS $^e$ redefines $D \in \mathbb{R}^{r \times d}$ and partitions input dimension $k$ into $r$ chunks:

$x = [x^{(1)}, x^{(2)}, \ldots, x^{(r)}], \quad x^{(i)} \in \mathbb{R}^{b \times l \times g}$

$S = \left[\sum_{j=1}^g x^{(1)}_{[:,:,j]}, \sum_{j=1}^g x^{(2)}_{[:,:,j]}, \ldots, \sum_{j=1}^g x^{(r)}_{[:,:,j]}\right] \in \mathbb{R}^{b \times l \times r}$

$\Delta Wx = D^T S, \quad y = W_0x + D^T S$

Technical Innovations

Single-Matrix Optimization: Compared to LoRA's requirement to optimize both matrices A and B simultaneously, MiSS only optimizes a single matrix D, reducing optimization complexity
Shard-Sharing Mechanism: Achieves low-rank properties through repeated matrix structure while maintaining expressiveness
Efficient Implementation: MiSS $^e$ avoids explicit storage of large matrices through block-level input aggregation, significantly reducing memory usage

Experimental Setup

Datasets

Natural Language Understanding (NLU): GLUE benchmark subsets including MNLI, SST-2, CoLA, QNLI, MRPC
Natural Language Generation (NLG):
- Mathematical Tasks: MetaMathQA dataset (395k subset), evaluated on GSM8K and MATH
- Code Tasks: CodeFeedback dataset (100k subset), evaluated on HumanEval and Mbpp

Evaluation Metrics

NLU tasks: Accuracy
Mathematical tasks: Accuracy on GSM8K and MATH benchmarks
Code tasks: Pass rate on HumanEval and Mbpp
Efficiency metrics: Training time, memory usage, initialization time

Comparison Methods

Multiple PEFT methods including LoRA, PiSSA, DoRA, VeRA, AdaLoRA, ProLoRA, MoS, etc.

Implementation Details

Optimizer: AdamW
Learning rate: 2e-5
Batch size: 64-128
Learning rate schedule: Cosine decay
MiSS rank settings: 16-128 (adjusted per task)

Experimental Results

Main Results

NLU Task Performance

On the GLUE benchmark with RoBERTa-base, MiSS performs particularly well on the CoLA dataset, achieving a score of 72.86, significantly outperforming LoRA (62.40) and PiSSA (67.28).

NLG Task Performance

Experimental results across multiple large language models show:

LLaMA2-7B:

GSM8K: MiSS(48.16) > PiSSA(43.89) > DoRA(42.93) > LoRA(40.75)
Math: MiSS(8.58) > PiSSA(6.92) > DoRA(6.51) > LoRA(5.22)
HumanEval: MiSS(23.63) > PiSSA(22.15) > DoRA(21.95) > LoRA(17.74)

Qwen3-4B:

Math: MiSS(34.82) substantially outperforms other methods, with PiSSA(26.00), DoRA(21.73), LoRA(15.20)

Gradient Norm Analysis

Initial gradient norm analysis validates MiSS's design philosophy. Experiments demonstrate that MiSS, like other improved LoRA variants, exhibits larger initial gradient norms compared to standard LoRA, which correlates with faster early convergence.

Efficiency Analysis

Complexity Comparison

Method	Space Complexity	Time Complexity
Full	O(dk)	O(bld(d+k))
LoRA	O(dr+rk)	O(blr(d+k))
MiSS	O(dr)	O(bldk)
MiSS $^e$	O(dr)	O(blr(d+k/r))

Pareto Frontier Analysis

Comprehensive evaluation on LLaMA-3.2-3B demonstrates that MiSS occupies an optimal position in the performance-efficiency trade-off, achieving the best test accuracy (0.5080) while maintaining low memory usage and training time.

Ablation Studies

Rank Parameter Impact

Testing different rank values on LLaMA2-7B:

rank=16: GSM8K(45.90), Math(3.77), Parameters 21.7M
rank=32: GSM8K(46.18), Math(7.43), Parameters 43.5M
rank=64: GSM8K(48.16), Math(8.58), Parameters 87.0M
rank=128: GSM8K(53.49), Math(10.08), Parameters 174.0M

Results show monotonic performance improvement with increasing rank, with rank=64 providing a good performance-parameter trade-off.

Classification of LoRA Improvement Methods

Adaptive Improvements: PiSSA, LoRA-GA, LoRA+, etc., primarily accelerating convergence through modified initialization strategies
Efficiency Optimization: VeRA, ProLoRA, MoS, etc., focusing on reducing computational and memory overhead

Advantages Relative to Existing Methods

Compared to existing methods, MiSS achieves significant efficiency improvements while maintaining performance through single-matrix optimization strategy, avoiding expensive initialization processes required by methods like PiSSA, and eliminating special optimizer requirements needed by methods like LoRA-GA.

Conclusions and Discussion

Main Conclusions

Single-Matrix Optimization: Demonstrates that single-matrix optimization reduces optimization complexity and accelerates convergence compared to dual-matrix optimization
Effective Trade-offs: MiSS achieves better balance among performance, memory, and computational efficiency
Broad Applicability: Demonstrates consistent superiority across diverse model architectures and task types

Limitations

Depth of Theoretical Analysis: While providing complexity analysis, theoretical explanations for why single-matrix optimization is more effective remain insufficient
Hyperparameter Sensitivity: Optimal rank parameter selection may require additional tuning for different tasks and models
Generality of Expansion Mechanism: Current shard expansion strategy may not be optimal and has room for improvement

Future Directions

Theoretical Foundations: Deeper investigation of theoretical foundations for single-matrix optimization
Adaptive Rank Selection: Development of methods for automatic optimal rank selection
Multimodal Extensions: Extension of MiSS to multimodal tasks

In-Depth Evaluation

Strengths

Strong Novelty: The proposed shard-sharing mechanism represents a novel and effective approach
Comprehensive Experiments: Covers multiple models, datasets, and evaluation dimensions with well-designed experimental setup
High Practical Value: Significantly improves efficiency while maintaining performance, demonstrating strong practical utility
Thorough Analysis: Provides in-depth analysis from multiple perspectives including gradient norms, complexity, and Pareto frontiers

Weaknesses

Theoretical Explanation: Theoretical explanations for why MiSS maintains expressiveness under single-matrix optimization remain insufficient
Benchmark Comparisons: Lacks comparison with some recent PEFT methods
Long-Sequence Performance: Insufficient testing on long-sequence tasks

Impact

Academic Contribution: Provides new design perspectives for the PEFT field, potentially inspiring related research
Practical Value: Simple and effective method, easy to implement and deploy
Reproducibility: Provides detailed implementation details and open-source code

Applicable Scenarios

Resource-Constrained Environments: Particularly suitable for scenarios with limited GPU memory
Large-Scale Deployment: Due to its efficiency, suitable for applications requiring large-scale deployment
Multi-Task Learning: Can serve as an efficient adapter in multi-task learning

References

The paper cites important PEFT methods including LoRA, PiSSA, DoRA, and standard evaluation benchmarks such as GSM8K and MATH, providing comprehensive background and comparison basis for related research.

Overall Assessment: This is a high-quality PEFT methodology paper that proposes the MiSS method with certain theoretical innovations, comprehensive experimental validation, and high practical value. The paper's main contribution lies in achieving better performance-efficiency trade-offs through single-matrix optimization, providing new research directions for the PEFT field.