Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication
Shan, Guo, Wei et al.
The rapid scaling of large language models demands more efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there are abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for integer weight mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.
academic
Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication
The rapid expansion of large language models imposes stringent requirements on hardware efficiency. Quantization techniques offer promising trade-offs between efficiency and performance. Ultra-low-bit quantization creates abundant opportunities for result reuse, which can be accelerated through lookup tables (LUTs). However, existing LUT methods incur significant computational and hardware overhead in LUT construction and rely solely on bit-serial computation, which is suboptimal for ternary weight networks. This paper proposes Platinum, a lightweight ASIC accelerator for mixed-precision integer weight matrix multiplication (mpGEMM). Platinum reduces LUT construction overhead through offline-generated construction paths and supports both generic bit-serial and optimized ternary weight execution via adaptive path switching. On BitNet b1.58-3B, Platinum achieves 73.6×, 4.09×, and 2.15× speedups compared to SpikingEyeriss, Prosperity, and 16-thread T-MAC respectively, with 32.4×, 3.23×, and 20.9× energy reduction, while occupying only 0.96mm² of chip area.
With the rapid growth of deep neural networks, particularly large language models (LLMs), energy consumption and computational latency have become major deployment challenges. General matrix multiplication (GEMM) dominates fully connected and attention layers, with computational burden scaling proportionally with model size.
Ultra-low-bit quantization (e.g., ternary weights {-1,0,1} in BitNet-b1.58) significantly improves efficiency while maintaining accuracy
Low-bit quantization enables LUT-based acceleration strategies through result precomputation and reuse
Problems with Existing LUT Methods:
Prosperity and Similar Methods: Dynamic LUT construction path scheduling incurs high hardware overhead (24% chip area, 32.3% power for scheduling modules)
Inefficiency of Bit-Serial Computation: Using 2-bit encoding for ternary weights exceeds the theoretical optimum of 1.58 bits (log₂3), with additional overhead from partial sum merging
Infeasibility of Precomputation: Offline precomputation of all LUT entries requires enormous storage (4GB for 8-bit activations with k=2)
For models like BitNet with uniform weight distribution, most LUT entries are utilized (only 1.16% unused), making dynamic scheduling overhead unnecessary
Ternary LUTs directly represent final results, with experiments showing 1.3× or greater performance improvement compared to binary LUTs
There is a need for a lightweight, energy-efficient specialized accelerator that simultaneously supports generic integer weights and optimized execution for specific bit widths
Platinum Accelerator Architecture: Designs a novel LUT-based mpGEMM accelerator with a decoupled path-based LUT construction framework that reduces LUT generation costs and minimizes hardware overhead
Path-Adaptive Execution: Through path switching, supports both generic bit-serial execution for integer weights and optimized execution for specific precisions (e.g., ternary weights)
System-Level Optimization Design:
Architecture optimized for parallelism and data flow
Lightweight modular design suitable for edge deployment
Chip area of only 0.96mm²
Outstanding Performance Results: On BitNet b1.58-3B achieves:
Up to 73.6× speedup compared to state-of-the-art baselines
32.4× energy reduction
Demonstrates the potential of LUT-based ASIC as a scalable solution for ultra-low-bit neural networks on edge platforms
Platinum represents significant progress in LUT-based neural network accelerator design. By cleverly decoupling path generation to offline processing and combining adaptive execution modes, it achieves excellent balance among hardware overhead, performance, and energy efficiency. The 73.6× speedup and compact 0.96mm² design make it a compelling solution for edge LLM inference.
However, the work has notable limitations: dependence on specific models (BitNet), limited generality, and absence of open-source implementation. Future research can enhance adaptability while maintaining low overhead, extending to broader quantization schemes and model architectures.
Overall, this is a high-quality computer architecture paper with solid technical innovation and comprehensive experimental evaluation, providing a new design paradigm for low-bit neural network acceleration. Recommended for researchers and engineers working on neural network accelerators, quantized inference, and edge AI chip design.