2025-11-13T04:07:09.837900

Optimal Quantization for Matrix Multiplication

Ordentlich, Polyanskiy

Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goal of these new compression algorithms is to be able to approximate not the matrices themselves, but their matrix product. Specifically, given a pair of real matrices $A,B$ an encoder (compressor) is applied to each of them independently producing descriptions with $R$ bits per entry. These representations subsequently are used by the decoder to estimate matrix product $A^\top B$. In this work, we provide a non-asymptotic lower bound on the mean squared error of this approximation (as a function of rate $R$) for the case of matrices $A,B$ with iid Gaussian entries. Algorithmically, we construct a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $\|\bar{A}\|_F, \|\bar{B}\|_F$ and $\|\bar{A}^\top \bar{B}\|_F$, where $\bar{A},\bar{B}$ are versions of $A,B$ with zero-centered columns, respectively. For iid Gaussian matrices our quantizer achieves the lower bound and is, thus, asymptotically optimal. A practical low-complexity version of our quantizer achieves performance quite close to optimal. In addition, we derive rate-distortion function for matrix multiplication of iid Gaussian matrices, which exhibits an interesting phase-transition at $R\approx 0.906$ bit/entry, showing necessity of Johnson-Lindestrauss dimensionality reduction (sketching) in the low-rate regime.

academic

行列乗算のための最適量子化

基本情報

論文ID: 2410.13780
タイトル: Optimal Quantization for Matrix Multiplication
著者: Or Ordentlich (ヘブライ大学エルサレム校)、Yury Polyanskiy (MIT)
分類: cs.IT cs.AI cs.CL cs.LG math.IT
発表時期: 2024年10月 (arXiv プレプリント)
論文リンク: https://arxiv.org/abs/2410.13780

要約

本論文は大規模行列乗算の量子化問題について深く研究している。従来のベクトル量子化と異なり、本研究の目標は行列そのものを近似することではなく、それらの行列積を近似することである。2つの実行列A、Bが与えられたとき、エンコーダは各行列を独立に圧縮し、各要素をRビットで記述し、その後デコーダはこれらの圧縮表現を利用して行列積A⊤Bを推定する。本論文は独立同分布ガウス要素を持つ行列の場合について、近似平均二乗誤差の非漸近下界を提供し、ネストされた格に基づく汎用量子化器を構築し、R≈0.906ビット/要素での興味深い相転移現象を発見した。これは低符号化率の場合、Johnson-Lindenstrauss次元削減技術が必要であることを示唆している。

研究背景と動機

問題定義

深層ニューラルネットワークと大規模言語モデルの台頭により、行列乗算が計算の主要なボトルネックとなっている。現代の計算ハードウェアはしばしばメモリ帯域幅によって制限されており、計算能力ではなく制限されている。したがって、メモリ転送を削減するために行列に対して損失圧縮を行うことが重要な問題となっている。

実用的ニーズ

大規模言語モデルについて、著者は必要な量子化率を推定した：

生成段階では、CPUが計算リソースを十分に活用するために1～3ビット/要素の量子化率が必要
プリフィル段階では、高速GPUで実行される小規模LLMについて、約11.7ビット/要素の量子化率が必要

既存手法の限界

古典的ベクトル量子化: 行列AとBを直接独立に量子化してから量子化行列の積を計算すると、O(n²)の誤差が生じる
スケッチ手法: 不偏推定を提供できるが、分散は依然としてO(n²)である
決定論的量子化器: 球面上のベクトルに対してΩ(n²)の下界が存在する

核心的貢献

理論的下界: iidガウス要素を持つ行列に対する行列乗算量子化の非漸近下界を提供
汎用量子化器: ネストされた格に基づく汎用量子化器を構築し、任意の行列に対して明確な誤差保証を提供
漸近最適性: 提案された量子化器がiidガウス行列に対して下界を達成することを証明し、したがって漸近最適である
相転移現象: R≈0.906ビット/要素での相転移を発見し、低符号化率での次元削減の必要性を明らかにした
実用的アルゴリズム: 最適性能に近い低複雑度実装を提供

方法の詳細

タスク定義

行列A ∈ R^(n×a)とB ∈ R^(n×b)が与えられたとき、エンコーダf₁: R^(n×a) → 2^(naR)とf₂: R^(n×b) → 2^(nbR)およびデコーダgを設計し、以下を最小化することが目標である：

$E\|A^⊤B - g(f_1(A), f_2(B))\|_F^2$

ここで各行列要素はRビットで記述される。

核心関数Γ(R)

論文は重要な率歪み関数を定義している：

$\Gamma(R) = \begin{cases} 1 - \left(1 - (2 \cdot 2^{-2R^*} - 2^{-4R^*})\right) \frac{R}{R^*} & R < R^* \\ 2 \cdot 2^{-2R} - 2^{-4R} & R \geq R^* \end{cases}$