2025-11-20T19:52:15.672703

A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices

Ringoot, Alomairy, Edelman

The reduction of a banded matrix to a bidiagonal form is a crucial step in the Singular Value Decomposition (SVD), a cornerstone of scientific computing and AI. Despite being a highly parallel algorithm, it was previously believed to be unsuitable for GPU computation because it is memory bandwidth-bound. Recent developments in GPU hardware, including larger L1 memory per Streaming Multiprocessor/Compute Unit, have changed that. We present the first GPU algorithm for reducing a banded matrix to bidiagonal form as part of the NextLA.jl open-source software package. Our algorithm is based on previous CPU-based multicore parallel cache-efficient bulge chasing algorithms and adapted to optimize for GPU throughput. We leverage Julia Language's Array abstractions and KernelAbstractions to implement a single hardware- and data precision-agnostic function on NVIDIA, AMD, Intel, and Apple Metal GPUs for half, single, and double precision, and examine performance optimization across hardware architectures and data precision. We also develop a hardware-aware performance model and identify key hyperparameters, such as inner tilewidth and block concurrency, that govern optimal GPU execution for bandwidth-bound workloads. We demonstrate highly parallel bandwidth-bound algorithm on the GPU can outperform CPU-based implementations: the GPU algorithm outperforms multithreaded CPU High-Performance libraries PLASMA and SLATE as of matrix size 1024 x 1024 and by a factor over 100 for matrices of 32k x 32k. In addition, the performance of the algorithm increases linearly with matrix bandwidth size, making faster reduction of larger matrix bandwidths now also possible. With this work, we break memory bandwidth barriers, as well as matrix bandwidth barriers, resulting in orders-of-magnitude faster algorithms for the reduction of banded matrices to bidiagonal form on the GPU.

academic

A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices

基本信息

论文ID: 2510.12705
标题: A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices
作者: Evelyne Ringoot, Rabab Alomairy, Alan Edelman (MIT Computer Science & AI Laboratory)
分类: cs.DC (Distributed, Parallel, and Cluster Computing), cs.MS (Mathematical Software)
发表时间: 2025年10月14日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.12705

摘要

本文提出了首个GPU常驻的内存感知算法，用于加速带状矩阵到双对角矩阵的约简，这是奇异值分解(SVD)中的关键步骤。尽管该算法具有高度并行性，但由于其内存带宽约束特性，此前被认为不适合GPU计算。随着GPU硬件的发展，特别是每个流多处理器/计算单元更大的L1内存，这一情况得以改变。作者基于先前的CPU多核并行缓存高效的bulge-chasing算法，并针对GPU吞吐量进行优化。该算法在NVIDIA、AMD、Intel和Apple Metal GPU上实现了硬件和数据精度无关的单一函数，支持半精度、单精度和双精度计算。实验表明，该GPU算法在1024×1024矩阵规模开始超越多线程CPU高性能库PLASMA和SLATE，在32k×32k矩阵上性能提升超过100倍。

研究背景与动机

问题定义

奇异值分解(SVD)是科学计算、机器学习和数据分析中的基础数值工具，广泛应用于主成分分析、潜在语义索引、低秩逼近和矩阵补全等领域。现代大规模硬件上的SVD通常采用三阶段过程：

稠密矩阵到带状矩阵的约简
带状矩阵到双对角矩阵的约简(bulge-chasing)
双对角矩阵到对角矩阵的约简

研究动机

虽然第一阶段和第三阶段的GPU实现已被广泛研究，但第二阶段在现代GPU上仍未得到充分探索。Dongarra等人在2014年指出"加速器在处理内存约束的细粒度计算任务(如bulge chasing)时表现不佳，限制了GPU实现第二阶段的潜在收益"。

技术机遇

近年来GPU架构的进步，特别是：

每个流多处理器L1缓存大小的扩展
更高的片上带宽
更灵活的内存访问模式
更好的缓存重用和更大的内存层次结构

这些改进显著改变了内存-计算平衡，为内存约束算法的重新设计创造了新机遇。

核心贡献

首个GPU常驻内存感知bulge-chasing算法：提出了首个用于带状矩阵到双对角矩阵约简的GPU原生算法，充分利用最新一代高性能GPU的优越内存特性，性能比优化的CPU实现提升10-100倍。
大带宽矩阵的缓存高效bulge-chasing策略：通过逐步带宽约简的新型缓存高效大带宽策略，GPU算法能够处理比以前更大的带宽，改变了大型SVD中第一阶段和第二阶段之间的最优带宽权衡。
开源硬件无关和数据精度无关实现：基于Julia语言抽象实现的开源库，单一函数定义支持NVIDIA、AMD、Intel和Apple GPU上的单精度、半精度和双精度计算。

Algorithm 1: 使用Householder向量的带状矩阵到双对角矩阵约简
输入: 带宽BW，内部瓦片宽度TW，矩阵大小n
1: for 带宽约简 i = (BW-1)/TW → 1 do
2:   目标带宽 TBW = 1 + i·TW  
3:   for 并行：每行 R = 1→n do
4:     行bulge k = R
5:     for 每个行bulge: j = 0, j+=1 do
6:       if 3(R-1) < j and k ≤ n then
7:         计算第k行的HH向量以消除TW个元素并应用到下方行
8:         计算最左生成列bulge的HH向量并应用到右侧列
9:         更新k值
10:      end if
11:    end for
12:  end for
13: end for

内存感知GPU内核实现

Algorithm 2: 内存感知GPU内核，每块TPB个线程
1: 线程内存: Ai (TW+1)
2: 块内存: X (TW+1)  
3: 块内所有线程协作: X ← A[k,..]
4: 同步线程
5: 块内所有线程协作: HH(X)
6: 块内所有线程协作: A[k,..]← X
7: 同步线程
8: for l: 0→(CBW + TW)/TPB - 1 do
9:   线程i: 计算行 r = k + l·CPB + i
10:  Ai ← A[r, ...]
11:  HH(X, Ai)  
12:  A[r, ...]← Ai
13: end for