A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices
Ringoot, Alomairy, Edelman
The reduction of a banded matrix to a bidiagonal form is a crucial step in the Singular Value Decomposition (SVD), a cornerstone of scientific computing and AI. Despite being a highly parallel algorithm, it was previously believed to be unsuitable for GPU computation because it is memory bandwidth-bound. Recent developments in GPU hardware, including larger L1 memory per Streaming Multiprocessor/Compute Unit, have changed that. We present the first GPU algorithm for reducing a banded matrix to bidiagonal form as part of the NextLA.jl open-source software package. Our algorithm is based on previous CPU-based multicore parallel cache-efficient bulge chasing algorithms and adapted to optimize for GPU throughput. We leverage Julia Language's Array abstractions and KernelAbstractions to implement a single hardware- and data precision-agnostic function on NVIDIA, AMD, Intel, and Apple Metal GPUs for half, single, and double precision, and examine performance optimization across hardware architectures and data precision. We also develop a hardware-aware performance model and identify key hyperparameters, such as inner tilewidth and block concurrency, that govern optimal GPU execution for bandwidth-bound workloads. We demonstrate highly parallel bandwidth-bound algorithm on the GPU can outperform CPU-based implementations: the GPU algorithm outperforms multithreaded CPU High-Performance libraries PLASMA and SLATE as of matrix size 1024 x 1024 and by a factor over 100 for matrices of 32k x 32k. In addition, the performance of the algorithm increases linearly with matrix bandwidth size, making faster reduction of larger matrix bandwidths now also possible. With this work, we break memory bandwidth barriers, as well as matrix bandwidth barriers, resulting in orders-of-magnitude faster algorithms for the reduction of banded matrices to bidiagonal form on the GPU.
academic
A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices
本文提出了首个GPU常驻的内存感知算法,用于加速带状矩阵到双对角矩阵的约简,这是奇异值分解(SVD)中的关键步骤。尽管该算法具有高度并行性,但由于其内存带宽约束特性,此前被认为不适合GPU计算。随着GPU硬件的发展,特别是每个流多处理器/计算单元更大的L1内存,这一情况得以改变。作者基于先前的CPU多核并行缓存高效的bulge-chasing算法,并针对GPU吞吐量进行优化。该算法在NVIDIA、AMD、Intel和Apple Metal GPU上实现了硬件和数据精度无关的单一函数,支持半精度、单精度和双精度计算。实验表明,该GPU算法在1024×1024矩阵规模开始超越多线程CPU高性能库PLASMA和SLATE,在32k×32k矩阵上性能提升超过100倍。
Algorithm 1: 使用Householder向量的带状矩阵到双对角矩阵约简
输入: 带宽BW,内部瓦片宽度TW,矩阵大小n
1: for 带宽约简 i = (BW-1)/TW → 1 do
2: 目标带宽 TBW = 1 + i·TW
3: for 并行:每行 R = 1→n do
4: 行bulge k = R
5: for 每个行bulge: j = 0, j+=1 do
6: if 3(R-1) < j and k ≤ n then
7: 计算第k行的HH向量以消除TW个元素并应用到下方行
8: 计算最左生成列bulge的HH向量并应用到右侧列
9: 更新k值
10: end if
11: end for
12: end for
13: end for
Algorithm 2: 内存感知GPU内核,每块TPB个线程
1: 线程内存: Ai (TW+1)
2: 块内存: X (TW+1)
3: 块内所有线程协作: X ← A[k,..]
4: 同步线程
5: 块内所有线程协作: HH(X)
6: 块内所有线程协作: A[k,..]← X
7: 同步线程
8: for l: 0→(CBW + TW)/TPB - 1 do
9: 线程i: 计算行 r = k + l·CPB + i
10: Ai ← A[r, ...]
11: HH(X, Ai)
12: A[r, ...]← Ai
13: end for