2025-11-20T05:49:14.768535

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure

Kang, Yin
Low-Rank Adaptation (LoRA) is a widely adopted technique for parameter-efficient fine-tuning, but its slow convergence has spurred the development of numerous variants. Nevertheless, existing methods often fail to improve performance, memory footprint, and computational efficiency simultaneously. To address this challenge, we revisit the causes of LoRA's slow convergence. Building on these insights, we propose Matrix Shard Sharing (MiSS), which updates shards of the original weight matrix using a single shared trainable matrix $\boldsymbol{D}$, initialized to zeros. To simultaneously ensure computational efficiency, low memory footprint, and scalable serving, we introduce MiSS$^e$. Both theoretical analysis and empirical results demonstrate that our method reduces optimization complexity without compromising performance, thereby achieving a more favorable trade-off among performance, memory, and efficiency. Furthermore, we conduct a comprehensive comparative analysis of various PEFT methods, evaluating their memory usage, initialization overhead, and computational efficiency. By mapping the Pareto frontier across these dimensions, we show that MiSS occupies a favorable position, effectively capturing the advantages of prior approaches.
academic

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure

Basic Information

  • Paper ID: 2409.15371
  • Title: MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure
  • Authors: Jiale Kang (Yuanshi Inc), Qingyu Yin (Zhejiang University)
  • Classification: cs.CL cs.AI
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2409.15371v11

Abstract

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning technique, yet its slow convergence has motivated the development of numerous variants. However, existing methods often fail to simultaneously improve performance, memory consumption, and computational efficiency. To address this challenge, this paper revisits the fundamental causes of LoRA's slow convergence. Based on these insights, the authors propose the Matrix Shard Sharing (MiSS) method, which uses a single shared trainable matrix D\boldsymbol{D} (initialized to zero) to update shards of the original weight matrix. To simultaneously ensure computational efficiency, low memory consumption, and scalable deployment, the authors introduce MiSSe^e. Both theoretical analysis and experimental results demonstrate that the method reduces optimization complexity without compromising performance, thereby achieving a more favorable trade-off among performance, memory, and efficiency.

Research Background and Motivation

Problem Definition

Full-parameter fine-tuning of large language models (LLMs) is computationally prohibitive, thus motivating the development of parameter-efficient fine-tuning (PEFT) techniques. LoRA, as one of the most prominent PEFT methods, approximates weight updates through low-rank decomposition: ΔWBA\Delta W \approx BA, where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d,k).

Limitations of Existing Methods

  1. Slow Convergence: LoRA converges significantly slower compared to full-parameter fine-tuning
  2. Optimization Complexity: Requires simultaneous updates of matrices A and B, increasing optimization complexity
  3. Difficult Trade-offs: Existing LoRA variants struggle to balance performance, memory, and efficiency simultaneously

Research Motivation

By analyzing methods such as S2FT and LoRA+, the authors identify that the key reason for LoRA's slow convergence is the need to optimize two matrices simultaneously. Based on the hypothesis that "training a single matrix can simplify optimization without sacrificing expressiveness," the authors propose the MiSS method.

Core Contributions

  1. Proposes MiSS Method: An efficient and adaptive structure with shard-sharing mechanism that achieves effective balance among three key attributes: performance, memory efficiency, and computational efficiency
  2. Theoretical and Experimental Validation: Validates MiSS's superiority across diverse datasets and model architectures through large-scale experiments
  3. Comprehensive PEFT Method Comparison: Provides integrated assessment of multiple PEFT methods in terms of memory usage, initialization overhead, and computational efficiency
  4. Pareto Frontier Analysis: Demonstrates MiSS's advantageous position by mapping the Pareto frontier across these dimensions

Method Details

Task Definition

Given a pre-trained weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, the objective is to learn a parameter-efficient update ΔW\Delta W such that the fine-tuned model performs well on downstream tasks while minimizing the number of trainable parameters and computational overhead.

Model Architecture

MiSS Basic Form

MiSS defines weight updates as large matrices generated from small trainable matrices DD through an expansion operation:

W=W0+ΔW=W0+expand(D)W = W_0 + \Delta W = W_0 + \text{expand}(D)y=W0x+expand(D)xy = W_0x + \text{expand}(D)x

where DRr1×r2D \in \mathbb{R}^{r_1 \times r_2}, (r1,r2)min(d,k)(r_1, r_2) \ll \min(d,k).

Expansion Mechanism

The output dimension dd is partitioned into NN shards with sizes {s1,s2,,sN}\{s_1, s_2, \ldots, s_N\}, where i=1Nsi=d\sum_{i=1}^N s_i = d. For each shard ii, its update is determined by repeating the ii-th row DiD_i of DD a total of sis_i times:

(expand(D))T=[(1s1D1)T(1s2D2)T(1sNDN)T](\text{expand}(D))^T = [(1_{s_1}D_1)^T \quad (1_{s_2}D_2)^T \quad \ldots \quad (1_{s_N}D_N)^T]

MiSSe^e Efficient Implementation

To avoid explicitly forming large matrices, MiSSe^e redefines DRr×dD \in \mathbb{R}^{r \times d} and partitions input dimension kk into rr chunks:

x=[x(1),x(2),,x(r)],x(i)Rb×l×gx = [x^{(1)}, x^{(2)}, \ldots, x^{(r)}], \quad x^{(i)} \in \mathbb{R}^{b \times l \times g}

S=[j=1gx[:,:,j](1),j=1gx[:,:,j](2),,j=1gx[:,:,j](r)]Rb×l×rS = \left[\sum_{j=1}^g x^{(1)}_{[:,:,j]}, \sum_{j=1}^g x^{(2)}_{[:,:,j]}, \ldots, \sum_{j=1}^g x^{(r)}_{[:,:,j]}\right] \in \mathbb{R}^{b \times l \times r}

ΔWx=DTS,y=W0x+DTS\Delta Wx = D^T S, \quad y = W_0x + D^T S

Technical Innovations

  1. Single-Matrix Optimization: Compared to LoRA's requirement to optimize both matrices A and B simultaneously, MiSS only optimizes a single matrix D, reducing optimization complexity
  2. Shard-Sharing Mechanism: Achieves low-rank properties through repeated matrix structure while maintaining expressiveness
  3. Efficient Implementation: MiSSe^e avoids explicit storage of large matrices through block-level input aggregation, significantly reducing memory usage

Experimental Setup

Datasets

  1. Natural Language Understanding (NLU): GLUE benchmark subsets including MNLI, SST-2, CoLA, QNLI, MRPC
  2. Natural Language Generation (NLG):
    • Mathematical Tasks: MetaMathQA dataset (395k subset), evaluated on GSM8K and MATH
    • Code Tasks: CodeFeedback dataset (100k subset), evaluated on HumanEval and Mbpp

Evaluation Metrics

  • NLU tasks: Accuracy
  • Mathematical tasks: Accuracy on GSM8K and MATH benchmarks
  • Code tasks: Pass rate on HumanEval and Mbpp
  • Efficiency metrics: Training time, memory usage, initialization time

Comparison Methods

Multiple PEFT methods including LoRA, PiSSA, DoRA, VeRA, AdaLoRA, ProLoRA, MoS, etc.

Implementation Details

  • Optimizer: AdamW
  • Learning rate: 2e-5
  • Batch size: 64-128
  • Learning rate schedule: Cosine decay
  • MiSS rank settings: 16-128 (adjusted per task)

Experimental Results

Main Results

NLU Task Performance

On the GLUE benchmark with RoBERTa-base, MiSS performs particularly well on the CoLA dataset, achieving a score of 72.86, significantly outperforming LoRA (62.40) and PiSSA (67.28).

NLG Task Performance

Experimental results across multiple large language models show:

LLaMA2-7B:

  • GSM8K: MiSS(48.16) > PiSSA(43.89) > DoRA(42.93) > LoRA(40.75)
  • Math: MiSS(8.58) > PiSSA(6.92) > DoRA(6.51) > LoRA(5.22)
  • HumanEval: MiSS(23.63) > PiSSA(22.15) > DoRA(21.95) > LoRA(17.74)

Qwen3-4B:

  • Math: MiSS(34.82) substantially outperforms other methods, with PiSSA(26.00), DoRA(21.73), LoRA(15.20)

Gradient Norm Analysis

Initial gradient norm analysis validates MiSS's design philosophy. Experiments demonstrate that MiSS, like other improved LoRA variants, exhibits larger initial gradient norms compared to standard LoRA, which correlates with faster early convergence.

Efficiency Analysis

Complexity Comparison

MethodSpace ComplexityTime Complexity
FullO(dk)O(bld(d+k))
LoRAO(dr+rk)O(blr(d+k))
MiSSO(dr)O(bldk)
MiSSe^eO(dr)O(blr(d+k/r))

Pareto Frontier Analysis

Comprehensive evaluation on LLaMA-3.2-3B demonstrates that MiSS occupies an optimal position in the performance-efficiency trade-off, achieving the best test accuracy (0.5080) while maintaining low memory usage and training time.

Ablation Studies

Rank Parameter Impact

Testing different rank values on LLaMA2-7B:

  • rank=16: GSM8K(45.90), Math(3.77), Parameters 21.7M
  • rank=32: GSM8K(46.18), Math(7.43), Parameters 43.5M
  • rank=64: GSM8K(48.16), Math(8.58), Parameters 87.0M
  • rank=128: GSM8K(53.49), Math(10.08), Parameters 174.0M

Results show monotonic performance improvement with increasing rank, with rank=64 providing a good performance-parameter trade-off.

Classification of LoRA Improvement Methods

  1. Adaptive Improvements: PiSSA, LoRA-GA, LoRA+, etc., primarily accelerating convergence through modified initialization strategies
  2. Efficiency Optimization: VeRA, ProLoRA, MoS, etc., focusing on reducing computational and memory overhead

Advantages Relative to Existing Methods

Compared to existing methods, MiSS achieves significant efficiency improvements while maintaining performance through single-matrix optimization strategy, avoiding expensive initialization processes required by methods like PiSSA, and eliminating special optimizer requirements needed by methods like LoRA-GA.

Conclusions and Discussion

Main Conclusions

  1. Single-Matrix Optimization: Demonstrates that single-matrix optimization reduces optimization complexity and accelerates convergence compared to dual-matrix optimization
  2. Effective Trade-offs: MiSS achieves better balance among performance, memory, and computational efficiency
  3. Broad Applicability: Demonstrates consistent superiority across diverse model architectures and task types

Limitations

  1. Depth of Theoretical Analysis: While providing complexity analysis, theoretical explanations for why single-matrix optimization is more effective remain insufficient
  2. Hyperparameter Sensitivity: Optimal rank parameter selection may require additional tuning for different tasks and models
  3. Generality of Expansion Mechanism: Current shard expansion strategy may not be optimal and has room for improvement

Future Directions

  1. Theoretical Foundations: Deeper investigation of theoretical foundations for single-matrix optimization
  2. Adaptive Rank Selection: Development of methods for automatic optimal rank selection
  3. Multimodal Extensions: Extension of MiSS to multimodal tasks

In-Depth Evaluation

Strengths

  1. Strong Novelty: The proposed shard-sharing mechanism represents a novel and effective approach
  2. Comprehensive Experiments: Covers multiple models, datasets, and evaluation dimensions with well-designed experimental setup
  3. High Practical Value: Significantly improves efficiency while maintaining performance, demonstrating strong practical utility
  4. Thorough Analysis: Provides in-depth analysis from multiple perspectives including gradient norms, complexity, and Pareto frontiers

Weaknesses

  1. Theoretical Explanation: Theoretical explanations for why MiSS maintains expressiveness under single-matrix optimization remain insufficient
  2. Benchmark Comparisons: Lacks comparison with some recent PEFT methods
  3. Long-Sequence Performance: Insufficient testing on long-sequence tasks

Impact

  1. Academic Contribution: Provides new design perspectives for the PEFT field, potentially inspiring related research
  2. Practical Value: Simple and effective method, easy to implement and deploy
  3. Reproducibility: Provides detailed implementation details and open-source code

Applicable Scenarios

  1. Resource-Constrained Environments: Particularly suitable for scenarios with limited GPU memory
  2. Large-Scale Deployment: Due to its efficiency, suitable for applications requiring large-scale deployment
  3. Multi-Task Learning: Can serve as an efficient adapter in multi-task learning

References

The paper cites important PEFT methods including LoRA, PiSSA, DoRA, and standard evaluation benchmarks such as GSM8K and MATH, providing comprehensive background and comparison basis for related research.


Overall Assessment: This is a high-quality PEFT methodology paper that proposes the MiSS method with certain theoretical innovations, comprehensive experimental validation, and high practical value. The paper's main contribution lies in achieving better performance-efficiency trade-offs through single-matrix optimization, providing new research directions for the PEFT field.