2025-11-19T14:37:13.961956

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Zhou, Wang, Xu

In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss during the approximation process. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model, while also cutting communication overhead by 54\% and memory usage by 13\%. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model.

academic

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Basic Information

Paper ID: 2406.06564v3
Title: SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information
Authors: Kaiye Zhou, Shucheng Wang, Jun Xu (China Mobile (Suzhou) Software Technology Co. Ltd.)
Classification: cs.LG, cs.AI, cs.CL
Publication Date: January 2, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2406.06564v3

Abstract

This paper proposes SwitchLoRA to address the challenges of parameter-efficient training during the pretraining phase of large language models. While traditional low-rank adaptation techniques such as LoRA excel in the fine-tuning stage, their direct application to pretraining results in significant performance degradation. Existing methods like ReLoRA and GaLore attempt to address this by updating low-rank subspaces but still fail to achieve full-rank training accuracy. SwitchLoRA incrementally updates the low-rank subspace by frequently and smoothly replacing trainable parameters in LoRA adapters, updating only a small number of dimensions at each step to minimize impact on optimizer states. Experimental results demonstrate that SwitchLoRA reduces perplexity on the LLaMA 1.3B model from 15.23 to 15.01, surpassing full-rank training while reducing communication overhead by 54% and memory usage by 13%.

Research Background and Motivation

Core Problem

With the rise of the Transformer architecture, the scale of large language models has grown dramatically, and distributed training of trillion-parameter models faces enormous inter-node communication overhead. While parameter-efficient techniques such as LoRA demonstrate excellent performance during fine-tuning, their direct application to the pretraining stage results in significant performance degradation.

Limitations of Existing Methods

ReLoRA: To maintain optimizer state consistency, it restricts update frequency, failing to sufficiently approximate full-rank training behavior
GaLore: Relies on SVD to approximate full-rank space, introducing precision loss during approximation

Research Motivation

Neural networks exhibit full-rank characteristics in early training stages, with internal rank gradually decreasing as training progresses. Therefore, a method is needed that can train large numbers of parameters during pretraining while selectively updating a subset of parameters to reduce memory usage and communication overhead.

Core Contributions

Proposes SwitchLoRA method: Frequently and smoothly adjusts trainable parameters in LoRA matrices, reducing memory usage and communication overhead while maintaining full-rank training accuracy
Optimizer state management strategy: Designs optimizer state reset and temporary freezing mechanisms during parameter switching to reduce the impact of state inconsistency
Improved initialization rules: Provides new initialization strategies for LoRA adapter parameters and their candidate vectors, improving training efficiency
Comprehensive experimental validation: Validates the method's effectiveness across LLaMA models of various scales and verifies inference capabilities through GLUE benchmark tests

Method Details

Task Definition

Given a weight matrix $W \in \mathbb{R}^{m \times n}$ of a pretrained model, traditional LoRA transforms it to $W + \frac{\alpha}{r}BA$ , where $B \in \mathbb{R}^{m \times r}$ , $A \in \mathbb{R}^{r \times n}$ , and $r \ll \min(m,n)$ . SwitchLoRA dynamically switches vectors in B and A to increase effective rank.

Model Architecture

Core Switching Mechanism

Vector Decomposition: Decomposes matrix B into column vectors $b_k \in \mathbb{R}^{m \times 1}$ and matrix A into row vectors $a_k^T \in \mathbb{R}^{1 \times n}$
Candidate Vector Sets: Maintains candidate vector sets $C(B)$ and $C(A^T)$ containing $\min(m,n)$ vectors
Dynamic Replacement: During training steps, replaces $b_k$ and $a_k$ with candidate vectors $b_k' \in C(B)$ and $a_k' \in C(A^T)$

Weight Adjustment Strategy

When vectors are replaced, the corresponding weight matrix is adjusted: $W \leftarrow W + b_k a_k^T - b_k' a_k'^T$

Switching Frequency Design

Employs exponential decay function: $frequency = Ce^{-\theta \cdot step}$ , reflecting the natural evolution of the model from full-rank to low-rank.

Technical Innovations

1. Minimizing Optimizer State Impact

When $a_k$ is switched, reset the optimizer state of $b_k$
When $b_k$ is switched, reset the optimizer state of $a_k$
Temporarily freeze corresponding parameters for N steps after reset (N=5)

2. Improved Initialization Strategy

Based on Xavier and Kaiming initialization principles, designs new standard deviations: $std[B] = std[b] = \left(\frac{r}{\sqrt{mn}}\right)^{\frac{1}{4}} gain^{\frac{1}{2}}$ $std[A] = std[a] = \left(\frac{\sqrt{mr}}{\sqrt{nn}}\right)^{\frac{1}{4}} gain^{\frac{1}{2}}$

3. Memory Optimization

Offloads spare candidate vectors to CPU, using non-blocking transfers to parallelize the switching process.

Experimental Setup

Datasets

Pretraining: C4 dataset, using the first 46M training samples and complete validation set
Evaluation: Validates loss on 10M tokens every 1000 steps
Fine-tuning: Multiple tasks from GLUE benchmark

Model Configuration

Experiments cover LLaMA models of various scales:

130M (768 dimensions, 12 heads, 12 layers)
250M (768 dimensions, 16 heads, 24 layers)
350M (1024 dimensions, 16 heads, 24 layers)
1.3B (2048 dimensions, 32 heads, 24 layers)

Evaluation Metrics

Pretraining: Perplexity
Fine-tuning: Accuracy, Pearson correlation coefficient, Matthews correlation coefficient

Comparison Methods

Full-rank training
Standard LoRA
ReLoRA
GaLore

Implementation Details

Optimizer: Adam (β₁=0.9, β₂=0.999)
Learning rate schedule: Cosine annealing with 100-step warmup
Total training steps: 40,000
Hardware: 8×NVIDIA A800 80GB PCIe GPUs

Experimental Results

Main Results

Pretraining Performance Comparison

Perplexity results on the 1.3B model:

Full-rank: 15.23
SwitchLoRA (rank=512): 15.01 (surpassing full-rank training)
SwitchLoRA (rank=256): 15.89

Resource Consumption Comparison

Using the 1.3B model as an example:

Memory Usage: 13% reduction compared to full-rank training (36.1GB → 31.9GB)
Communication Overhead: 54% reduction (trainable parameters from 1339M to 610M)
Training Time: Essentially equivalent (21.6s vs 22.5s)

Comparison with Existing Methods

vs ReLoRA

On the 250M model, when ReLoRA uses 5000 steps of full-rank pretraining while SwitchLoRA uses only 200 steps, SwitchLoRA still performs better. Under the same 1000-step full-rank pretraining condition, SwitchLoRA significantly outperforms ReLoRA.

vs GaLore

On the 350M model:

GaLore: 20.29 perplexity
SwitchLoRA: 19.58 perplexity

Under lower-rank settings, SwitchLoRA's advantages are even more pronounced, demonstrating the importance of covering all update directions.

Ablation Studies

Impact of Switching Frequency

Experiments show that both initial frequency and decay rate need to be set to moderate values; both too high and too low reduce performance.

Impact of Freezing Steps

The choice of freezing steps N affects training effectiveness, with N=5 being the optimal setting.

Initialization Strategy Validation

The new initialization method significantly improves convergence speed compared to traditional LoRA initialization.

Inference Capability Verification

GLUE Benchmark Results

On the 350M model:

SwitchLoRA pretrained model averages 3.0 points higher than GaLore pretrained model
Averages 0.3 points higher than full-rank pretrained model

On the 1.3B model:

SwitchLoRA pretrained model averages approximately 1.0 point higher than full-rank pretrained model

Experimental Findings

Rank Distribution Analysis: SwitchLoRA's singular value distribution is closer to full-rank training, while standard LoRA shows pathological distribution
Scale Effects: As model scale increases, SwitchLoRA's advantages over standard LoRA become more pronounced
Generalization Ability: Models pretrained with SwitchLoRA demonstrate stronger inference and generalization capabilities on downstream tasks

Low-Rank Decomposition Methods

Early work achieves low-rank approximation of weight matrices through methods like SVD, primarily applied to CNNs and small-scale language models.

LoRA Variants

Parameter Merging: Chain of LoRA, ReLoRA increase effective rank through periodic parameter merging
Initialization Improvement: Improve initialization strategies and learning rate settings for B and A matrices
Structural Modification: Modify LoRA's training process and parameter update mechanisms

Other Compression Methods

Include quantization, pruning, gradient compression, and other techniques, with GaLore achieving memory-efficient training through gradient projection.

Conclusions and Discussion

Main Conclusions

Performance Breakthrough: SwitchLoRA is the first to achieve performance surpassing full-rank training during the pretraining stage
Resource Efficiency: Significantly reduces memory usage and communication overhead while maintaining comparable training time
Enhanced Generalization: Pretrained models demonstrate stronger inference capabilities on downstream tasks

Limitations

Hyperparameter Sensitivity: Hyperparameters such as switching frequency require careful tuning
Rank Selection: Still requires relatively large LoRA rank to achieve full-rank training accuracy
Candidate Vector Selection: Currently uses random or sequential selection, with potential optimization space

Future Directions

Adaptive Frequency: Develop more intelligent switching frequency adjustment strategies
Layer-wise Optimization: Design differentiated switching strategies for different layer types (Q, K, V matrices)
Candidate Vector Optimization: Research more effective candidate vector selection and update strategies

In-Depth Evaluation

Strengths

Theoretical Innovation: Proposes a novel approach for incrementally updating low-rank subspaces, effectively solving the low-rank training problem during pretraining
Engineering Implementation: Carefully considers practical issues such as optimizer state management and memory optimization, demonstrating strong practical utility
Comprehensive Experiments: Validates method effectiveness from multiple perspectives, including pretraining performance, resource consumption, and inference capabilities
Theoretical Analysis: Provides theoretical explanations for vector update independence and optimizer state reset rationality

Weaknesses

Increased Complexity: Adds implementation complexity compared to standard LoRA, requiring additional candidate vector management
Hyperparameter Tuning: Multiple hyperparameters (switching frequency, decay rate, freezing steps) require careful tuning
Scale Verification: While testing multiple model scales, the largest is only 7B, with applicability to larger models remaining to be verified
Theoretical Completeness: While providing some theoretical analysis, lacks in-depth theoretical explanation for why it surpasses full-rank training

Impact

Academic Value: Provides new insights for parameter-efficient training during pretraining, potentially inspiring further related research
Practical Value: Significantly reduces resource consumption while maintaining performance, holding important significance for practical large-scale model training
Reproducibility: Paper provides detailed implementation details and hyperparameter settings, facilitating reproduction and application

Applicable Scenarios

Large Model Pretraining: Particularly suitable for resource-constrained scenarios requiring high-quality pretraining
Distributed Training: Can significantly reduce communication overhead in multi-node training
Incremental Training: Suitable for scenarios requiring continued training on top of pretraining

References

The paper cites extensive related work, primarily including:

Hu et al. 2022: Original LoRA paper
Lialin et al. 2023: ReLoRA method
Zhao et al. 2024: GaLore method
Vaswani et al. 2017: Transformer architecture
Rajbhandari et al. 2020: ZeRO optimizer

Overall Assessment: This is a high-quality research paper demonstrating excellence in theoretical innovation, experimental validation, and practical value. The SwitchLoRA method cleverly solves the low-rank training problem during pretraining, not only maintaining training effectiveness but also achieving significant resource savings. While some limitations exist, its contributions are sufficient to advance the field.