SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information
Zhou, Wang, Xu
In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss during the approximation process. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model, while also cutting communication overhead by 54\% and memory usage by 13\%. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model.
academic
SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information
This paper proposes SwitchLoRA to address the challenges of parameter-efficient training during the pretraining phase of large language models. While traditional low-rank adaptation techniques such as LoRA excel in the fine-tuning stage, their direct application to pretraining results in significant performance degradation. Existing methods like ReLoRA and GaLore attempt to address this by updating low-rank subspaces but still fail to achieve full-rank training accuracy. SwitchLoRA incrementally updates the low-rank subspace by frequently and smoothly replacing trainable parameters in LoRA adapters, updating only a small number of dimensions at each step to minimize impact on optimizer states. Experimental results demonstrate that SwitchLoRA reduces perplexity on the LLaMA 1.3B model from 15.23 to 15.01, surpassing full-rank training while reducing communication overhead by 54% and memory usage by 13%.
With the rise of the Transformer architecture, the scale of large language models has grown dramatically, and distributed training of trillion-parameter models faces enormous inter-node communication overhead. While parameter-efficient techniques such as LoRA demonstrate excellent performance during fine-tuning, their direct application to the pretraining stage results in significant performance degradation.
Neural networks exhibit full-rank characteristics in early training stages, with internal rank gradually decreasing as training progresses. Therefore, a method is needed that can train large numbers of parameters during pretraining while selectively updating a subset of parameters to reduce memory usage and communication overhead.
Proposes SwitchLoRA method: Frequently and smoothly adjusts trainable parameters in LoRA matrices, reducing memory usage and communication overhead while maintaining full-rank training accuracy
Optimizer state management strategy: Designs optimizer state reset and temporary freezing mechanisms during parameter switching to reduce the impact of state inconsistency
Improved initialization rules: Provides new initialization strategies for LoRA adapter parameters and their candidate vectors, improving training efficiency
Comprehensive experimental validation: Validates the method's effectiveness across LLaMA models of various scales and verifies inference capabilities through GLUE benchmark tests
Given a weight matrix W∈Rm×n of a pretrained model, traditional LoRA transforms it to W+rαBA, where B∈Rm×r, A∈Rr×n, and r≪min(m,n). SwitchLoRA dynamically switches vectors in B and A to increase effective rank.
Based on Xavier and Kaiming initialization principles, designs new standard deviations:
std[B]=std[b]=(mnr)41gain21std[A]=std[a]=(nnmr)41gain21
On the 250M model, when ReLoRA uses 5000 steps of full-rank pretraining while SwitchLoRA uses only 200 steps, SwitchLoRA still performs better. Under the same 1000-step full-rank pretraining condition, SwitchLoRA significantly outperforms ReLoRA.
Rank Distribution Analysis: SwitchLoRA's singular value distribution is closer to full-rank training, while standard LoRA shows pathological distribution
Scale Effects: As model scale increases, SwitchLoRA's advantages over standard LoRA become more pronounced
Generalization Ability: Models pretrained with SwitchLoRA demonstrate stronger inference and generalization capabilities on downstream tasks
Include quantization, pruning, gradient compression, and other techniques, with GaLore achieving memory-efficient training through gradient projection.
Theoretical Innovation: Proposes a novel approach for incrementally updating low-rank subspaces, effectively solving the low-rank training problem during pretraining
Engineering Implementation: Carefully considers practical issues such as optimizer state management and memory optimization, demonstrating strong practical utility
Comprehensive Experiments: Validates method effectiveness from multiple perspectives, including pretraining performance, resource consumption, and inference capabilities
Theoretical Analysis: Provides theoretical explanations for vector update independence and optimizer state reset rationality
Academic Value: Provides new insights for parameter-efficient training during pretraining, potentially inspiring further related research
Practical Value: Significantly reduces resource consumption while maintaining performance, holding important significance for practical large-scale model training
Reproducibility: Paper provides detailed implementation details and hyperparameter settings, facilitating reproduction and application
The paper cites extensive related work, primarily including:
Hu et al. 2022: Original LoRA paper
Lialin et al. 2023: ReLoRA method
Zhao et al. 2024: GaLore method
Vaswani et al. 2017: Transformer architecture
Rajbhandari et al. 2020: ZeRO optimizer
Overall Assessment: This is a high-quality research paper demonstrating excellence in theoretical innovation, experimental validation, and practical value. The SwitchLoRA method cleverly solves the low-rank training problem during pretraining, not only maintaining training effectiveness but also achieving significant resource savings. While some limitations exist, its contributions are sufficient to advance the field.