2025-11-19T14:37:13.961956

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Zhou, Wang, Xu

In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss during the approximation process. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model, while also cutting communication overhead by 54\% and memory usage by 13\%. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model.

academic

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

基本信息

论文ID: 2406.06564v3
标题: SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information
作者: Kaiye Zhou, Shucheng Wang, Jun Xu (China Mobile (Suzhou) Software Technology Co. Ltd.)
分类: cs.LG, cs.AI, cs.CL
发表时间: 2025年1月2日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2406.06564v3

摘要

本文针对大语言模型预训练阶段参数高效训练的挑战，提出了SwitchLoRA方法。传统的LoRA等低秩适应技术在微调阶段表现优异，但直接应用于预训练会导致性能显著下降。现有的ReLoRA和GaLore方法虽然尝试通过更新低秩子空间来解决这一问题，但仍无法达到全秩训练的精度。SwitchLoRA通过频繁且平滑地替换LoRA适配器的可训练参数，增量更新低秩子空间，每次仅针对少数维度以最小化对优化器状态的影响。实验结果显示，SwitchLoRA在LLaMA 1.3B模型上的困惑度从15.23降低到15.01，超越了全秩训练，同时减少了54%的通信开销和13%的内存使用。

研究背景与动机

核心问题

随着Transformer架构的兴起，大语言模型规模急剧增长，万亿级模型的分布式训练面临巨大的节点间通信开销。虽然LoRA等参数高效技术在微调阶段表现出色，但直接应用于预训练阶段会导致显著的性能下降。

现有方法的局限性

ReLoRA: 为保持优化器状态一致性，限制了更新频率，无法充分逼近全秩训练行为
GaLore: 依赖SVD近似全秩空间，在近似过程中引入精度损失

研究动机

神经网络在训练初期表现出全秩特征，随着训练进行内部秩逐渐降低。因此需要一种方法能够在预训练阶段训练大量参数，同时选择性地更新部分参数以减少内存使用和通信开销。

核心贡献

提出SwitchLoRA方法: 通过频繁平滑调整LoRA矩阵的可训练参数，在保持全秩训练精度的同时减少内存使用和通信开销
优化器状态管理策略: 设计了参数切换时的优化器状态重置和临时冻结机制，减少状态不一致的影响
改进的初始化规则: 为LoRA适配器参数及其候选向量提供了新的初始化策略，提高训练效率
全面的实验验证: 在多种规模的LLaMA模型上验证了方法的有效性，并通过GLUE基准测试验证了推理能力

方法详解

任务定义

给定预训练模型的权重矩阵 $W \in \mathbb{R}^{m \times n}$ ，传统LoRA将其转换为 $W + \frac{\alpha}{r}BA$ ，其中 $B \in \mathbb{R}^{m \times r}$ ， $A \in \mathbb{R}^{r \times n}$ ， $r \ll \min(m,n)$ 。SwitchLoRA在此基础上动态切换B和A中的向量以增加有效秩。

模型架构

核心切换机制

向量分解: 将矩阵B分解为列向量 $b_k \in \mathbb{R}^{m \times 1}$ ，矩阵A分解为行向量 $a_k^T \in \mathbb{R}^{1 \times n}$
候选向量集合: 维护候选向量集合 $C(B)$ 和 $C(A^T)$ ，包含 $\min(m,n)$ 个向量
动态替换: 在训练步骤中，将 $b_k$ 和 $a_k$ 替换为候选向量 $b_k' \in C(B)$ 和 $a_k' \in C(A^T)$

权重调整策略

当向量被替换时，相应调整权重矩阵： $W \leftarrow W + b_k a_k^T - b_k' a_k'^T$

切换频率设计

采用指数衰减函数： $frequency = Ce^{-\theta \cdot step}$ ，反映了模型从全秩到低秩的自然演化过程。

技术创新点

1. 最小化优化器状态影响

当 $a_k$ 被切换时，重置 $b_k$ 的优化器状态
当 $b_k$ 被切换时，重置 $a_k$ 的优化器状态
重置后临时冻结相应参数N步（N=5）

2. 改进的初始化策略

基于Xavier和Kaiming初始化思想，设计新的标准差： $std[B] = std[b] = \left(\frac{r}{\sqrt{mn}}\right)^{\frac{1}{4}} gain^{\frac{1}{2}}$ $std[A] = std[a] = \left(\frac{\sqrt{mr}}{\sqrt{nn}}\right)^{\frac{1}{4}} gain^{\frac{1}{2}}$