2025-11-11T15:58:09.452987

ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

Zhang, Yang, Cai et al.

As large language models (LLMs) continue to scale in size, the computational overhead has become a major bottleneck for task-specific fine-tuning. While low-rank adaptation (LoRA) effectively curtails this cost by confining the weight updates to a low-dimensional subspace, such a restriction can hinder effectiveness and slow convergence. This contribution deals with these limitations by accumulating progressively a high-rank weight update from consecutive low-rank increments. Specifically, the per update optimal low-rank matrix is identified to minimize the loss function and closely approximate full fine-tuning. To endow efficient and seamless optimization without restarting, this optimal choice is formed by appropriately scaling the columns of the original low-rank matrix. Rigorous performance guarantees reveal that the optimal scaling can be found analytically. Extensive numerical tests with popular LLMs scaling up to 12 billion parameters demonstrate a consistent performance gain and fast convergence relative to state-of-the-art LoRA variants on diverse tasks including natural language understanding, commonsense reasoning, and mathematical problem solving.

academic

ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

基本信息

论文ID: 2510.23818
标题: ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
作者: Yilang Zhang, Xiaodong Yang, Yiwei Cai, Georgios B. Giannakis
机构: University of Minnesota - Twin Cities, Visa Research
分类: cs.LG
提交时间: 2025年10月27日
论文链接: https://arxiv.org/abs/2510.23818v1

摘要

随着大语言模型(LLMs)规模的不断扩大，计算开销已成为任务特定微调的主要瓶颈。虽然低秩适应(LoRA)通过将权重更新限制在低维子空间中有效降低了成本，但这种限制会阻碍效果并减慢收敛速度。本研究通过逐步累积连续低秩增量来形成高秩权重更新，解决了这些局限性。具体而言，识别每次更新的最优低秩矩阵以最小化损失函数并紧密逼近全量微调。为了实现高效无缝的优化而无需重启，这种最优选择通过适当缩放原始低秩矩阵的列来形成。严格的性能保证表明最优缩放可以通过解析方法找到。在高达120亿参数的流行LLMs上进行的广泛数值测试表明，相对于最先进的LoRA变体，该方法在自然语言理解、常识推理和数学问题求解等多样化任务上实现了一致的性能提升和快速收敛。

研究背景与动机

问题定义

随着大语言模型规模的快速增长，传统的全量微调方法因其巨大的计算负担变得越来越不可行。例如，即使是Llama 4 Scout的最小变体也包含1090亿参数，即使使用半精度，全量微调仍需要超过1TB的GPU内存和大量的时间。

现有方法的局限性

LoRA的限制：虽然LoRA通过将权重更新参数化为两个高瘦矩阵的外积有效降低了计算成本，但其固定的低维子空间限制导致性能下降和收敛缓慢。
高秩更新的挑战：现有的高秩更新方法如ReLoRA需要重启优化，MoRA需要精心设计的非线性映射，HiRA的Hadamard积操作复杂度高。

研究动机

本文旨在通过动态识别最优低秩适配器来克服LoRA的局限性，通过堆叠逐步的低秩增量来形成高秩权重更新，同时保持计算效率。

核心贡献

理论分析：证明了最优低秩适配器的充分必要条件，建立了需要截断SVD的条件，但指出其计算开销过大。
ScaLoRA方法：提出了通过列缩放变换来限制新适配器的方法，在解析形式下可证明地识别全局最优适配器和可处理的矩估计器。
实验验证：在DeBERTaV3-base、LLaMA-2-7B、LLaMA-3-8B和Gemma-3-12B-pt等模型上进行了全面测试，验证了理论分析并确认了ScaLoRA的优越性能和加速收敛。

方法详解

任务定义

考虑大模型的一般权重矩阵 $W \in \mathbb{R}^{m \times n}$ ，LoRA将其分解为 $W = W^{pt} + W^{ft}$ ，其中 $W^{pt}$ 是冻结的预训练权重， $W^{ft} := AB^T$ 是可学习的微调更新， $A \in \mathbb{R}^{m \times r}$ ， $B \in \mathbb{R}^{n \times r}$ ，且 $r \ll m,n$ 。

核心思想：动态最优低秩适配器

与LoRA固定在 $A_tB_t^T$ 不同，ScaLoRA的关键思想是动态识别每次迭代的"最优"低秩适配器，最大化损失下降：

$W_t = W^{pt} + A_tB_t^T = \underbrace{(W^{pt} + A_tB_t^T - \tilde{A}_t\tilde{B}_t^T)}_{\text{合并并冻结}} + \underbrace{\tilde{A}_t\tilde{B}_t^T}_{\text{可学习}}$

最优低秋适配器的理论分析

定理1（最优条件）：考虑SVD $\nabla\ell(W_t) = U_t\Sigma_tV_t^T$ ，如果 $\text{rank}(\nabla\ell(W_t)) \geq 2r, \forall t$ 且满足Lipschitz平滑假设，则 $(\tilde{A}_t^*, \tilde{B}_t^*)$ 最小化损失上界当且仅当：

$\tilde{A}_t^* = \frac{1}{\sqrt{L\eta}}[U_t]_{\mathcal{A}_t}P_t, \quad \tilde{B}_t^* = \frac{1}{\sqrt{L\eta}}[V_t]_{\mathcal{B}_t}Q_t$

其中 $\mathcal{A}_t \cup \mathcal{B}_t = \{1,\ldots,2r\}$ ， $|\mathcal{A}_t| = |\mathcal{B}_t| = r$ ， $P_t, Q_t \in O(r)$ 。

标量缩放的最优解

为避免SVD的计算开销，ScaLoRA限制为 $\tilde{A}_t = \alpha_t A_t$ ， $\tilde{B}_t = \beta_t B_t$ 。

定理3（标量缩放最优解）：在假设1-2下，目标函数的全局最小值由以下给出：

$(\alpha_t^*, \beta_t^*) = \begin{cases} \left(\pm\frac{\|A_t^T\nabla\ell(W_t)\|_F}{\sqrt{L\eta\|A_tA_t^T\nabla\ell(W_t)\|_F}}, 0\right) & \text{if } C_t^A > 0, C_t^B \leq 0 \\ \left(0, \pm\frac{\|\nabla\ell(W_t)B_t\|_F}{\sqrt{L\eta\|\nabla\ell(W_t)B_tB_t^T\|_F}}\right) & \text{if } C_t^A \leq 0, C_t^B > 0 \\ \left(\pm\sqrt{\frac{C_t^A}{L\eta C_t}}, \pm\sqrt{\frac{C_t^B}{L\eta C_t}}\right) & \text{if } C_t^A \geq 0, C_t^B \geq 0, C_t > 0 \end{cases}$

列缩放的最优解

为了提高拟合能力，ScaLoRA进一步考虑列缩放 $\tilde{A}_t = A_t\text{diag}(\alpha_t)$ ， $\tilde{B}_t = B_t\text{diag}(\beta_t)$ 。

定理5（列缩放最优解）：如果线性方程组 $[(S_t^{A\top}S_t^A) \odot (S_t^{B\top}S_t^B)]v_t = \lambda_t$ 有非负解 $v_t \in \mathbb{R}_+^{2r}$ ，则全局最小值为：

$\begin{bmatrix} \alpha_t^* \\ \beta_t^* \end{bmatrix} = \pm\frac{1}{\sqrt{L\eta}}v_t^{\circ\frac{1}{2}}$

ScaLoRA算法流程

ScaLoRA采用混合缩放策略：

当线性系统有正解时，使用列缩放
否则，使用标量缩放
根据相应的引理更新矩估计器

复杂度分析

时间复杂度： $O(mnr + (m+n+r)r^2)$
空间复杂度： $O((m+n+r)r)$
ScaLoRA-I变体：每I次迭代执行一次，时间复杂度摊销为 $O((mnr+(m+n+r)r^2)/I)$

实验设置

数据集

GLUE基准：8个自然语言理解任务
常识推理：BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-easy, ARC-challenge, OpenBookQA
数学问题求解：MetaMathQA (训练), GSM8K和MATH (测试)