2025-12-01T05:34:19.512651

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

Shan, Guo, Wei et al.

The rapid scaling of large language models demands more efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there are abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for integer weight mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.

academic

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

Basic Information

Paper ID: 2511.21910
Title: Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication
Authors: Haoxuan Shan, Cong Guo, Chiyue Wei, Feng Cheng, Junyao Zhang, Hai (Helen) Li, Yiran Chen
Affiliation: Duke University, Department of Electrical and Computer Engineering
Category: cs.AR (Computer Architecture)
Submission Date: November 26, 2025 to arXiv
Paper Link: https://arxiv.org/abs/2511.21910

Abstract

The rapid expansion of large language models imposes stringent requirements on hardware efficiency. Quantization techniques offer promising trade-offs between efficiency and performance. Ultra-low-bit quantization creates abundant opportunities for result reuse, which can be accelerated through lookup tables (LUTs). However, existing LUT methods incur significant computational and hardware overhead in LUT construction and rely solely on bit-serial computation, which is suboptimal for ternary weight networks. This paper proposes Platinum, a lightweight ASIC accelerator for mixed-precision integer weight matrix multiplication (mpGEMM). Platinum reduces LUT construction overhead through offline-generated construction paths and supports both generic bit-serial and optimized ternary weight execution via adaptive path switching. On BitNet b1.58-3B, Platinum achieves 73.6×, 4.09×, and 2.15× speedups compared to SpikingEyeriss, Prosperity, and 16-thread T-MAC respectively, with 32.4×, 3.23×, and 20.9× energy reduction, while occupying only 0.96mm² of chip area.

Research Background and Motivation

1. Core Problem to Address

With the rapid growth of deep neural networks, particularly large language models (LLMs), energy consumption and computational latency have become major deployment challenges. General matrix multiplication (GEMM) dominates fully connected and attention layers, with computational burden scaling proportionally with model size.

2. Problem Significance

Energy Efficiency Requirements: LLM inference must run efficiently on edge devices
Real-time Constraints: Reducing computational latency is critical for user experience
Hardware Cost: Achieving high performance within limited chip area and power budgets

3. Limitations of Existing Approaches

Opportunities in Quantization:

Ultra-low-bit quantization (e.g., ternary weights {-1,0,1} in BitNet-b1.58) significantly improves efficiency while maintaining accuracy
Low-bit quantization enables LUT-based acceleration strategies through result precomputation and reuse

Problems with Existing LUT Methods:

Prosperity and Similar Methods: Dynamic LUT construction path scheduling incurs high hardware overhead (24% chip area, 32.3% power for scheduling modules)
Inefficiency of Bit-Serial Computation: Using 2-bit encoding for ternary weights exceeds the theoretical optimum of 1.58 bits (log₂3), with additional overhead from partial sum merging
Infeasibility of Precomputation: Offline precomputation of all LUT entries requires enormous storage (4GB for 8-bit activations with k=2)

4. Research Motivation

For models like BitNet with uniform weight distribution, most LUT entries are utilized (only 1.16% unused), making dynamic scheduling overhead unnecessary
Ternary LUTs directly represent final results, with experiments showing 1.3× or greater performance improvement compared to binary LUTs
There is a need for a lightweight, energy-efficient specialized accelerator that simultaneously supports generic integer weights and optimized execution for specific bit widths

Core Contributions

Platinum Accelerator Architecture: Designs a novel LUT-based mpGEMM accelerator with a decoupled path-based LUT construction framework that reduces LUT generation costs and minimizes hardware overhead
Path-Adaptive Execution: Through path switching, supports both generic bit-serial execution for integer weights and optimized execution for specific precisions (e.g., ternary weights)
System-Level Optimization Design:
- Architecture optimized for parallelism and data flow
- Lightweight modular design suitable for edge deployment
- Chip area of only 0.96mm²
Outstanding Performance Results: On BitNet b1.58-3B achieves:
- Up to 73.6× speedup compared to state-of-the-art baselines
- 32.4× energy reduction
- Demonstrates the potential of LUT-based ASIC as a scalable solution for ultra-low-bit neural networks on edge platforms

Methodology Details

Task Definition

Mixed-Precision GEMM (mpGEMM):

Input: Weight matrix W (m×k, low-bit integer), activation matrix X (k×n, 8-bit integer)
Output: Result matrix Y (m×n)
Objective: Efficiently compute Y = W·X, with particular optimization for ternary weight scenarios

Overall Architecture Design

Platinum Processor Composition (Figure 3):

L Platinum Processing Elements (PPEs): Each containing a controller, adder, and dedicated LUT buffer
Aggregator: Shares adders across PPEs, combined with additional adders forming a pipelined adder tree
High-Bandwidth On-Chip Buffers: Including weight, input, output, and construction path buffers
Special Function Unit (SFU): Supporting operations beyond GEMM (e.g., vector multiplication, activation functions)

Key Parameters:

L = 52 PPEs
8-bit per LUT entry (aligned with BitNet's 8-bit activations)
Chunk size c = 5 for ternary weights (generating 128-entry LUT)
Each PPE processes ncols = 8 input columns

LUT Construction Method Innovation

1. Offline Path Generation (Based on Minimum Spanning Tree MST)

Problem Formulation:

Formalizes LUT construction as a directed hypergraph
Each node represents a LUT entry
Each hyperedge represents a computational operation

MST Algorithm Application:

Source node: lut[0] = 0
Operation constraint: Only addition/subtraction of input elements
Goal: Find minimum-cost path connecting all nodes

Advantages:

Exploits symmetry to reduce LUT size to ⌈3^c/2⌉
For c=5, reduces addition operations by ~10× compared to naive construction
Ensures correct data dependencies (topological sort)
Shortest read-after-write (RAW) dependency distance exceeds pipeline stages, eliminating need for additional hazard handling

2. Four-Stage Construction Pipeline (Figure 4)

Stage 1: Load construction path (dst, src, j, sign)
Stage 2: LUT read + input access
Stage 3: Adder computation lut[src] ± a[j]
Stage 4: LUT write-back

Path Format:

(dst, src, j, flip) represents lut[dst] = lut[src] ± aj

Ternary Weight Optimization

1. Computational Complexity Analysis

Bit-Serial Method (Equation 1):

#add_bs = [⌈K/c⌉·c·2^c + M·⌈K/c⌉ + M(⌈K/c⌉-1)]·N

Ternary LUT Method (Equation 2):

#add_ter = [⌈K/c⌉·c·3^c + M(⌈K/c⌉-1)]·N

Platinum Optimized Method (Equation 3):

#add_platinum = [⌈K/c⌉·⌈3^c/2⌉ + M(⌈K/c⌉-1)]·N

Exploits symmetry through mirror consolidation to reduce LUT size and construction cost.

2. Compact Weight Encoding

Problem:

2-bit encoding: Far exceeds theoretical optimum of 1.58 bits
Byte storage: Extremely redundant

Solution:

Pack every c ternary weights into a base-3 integer
Requires ⌈log₂3^c⌉ bits
Further divided into 1 sign bit and ⌈log₂3^c⌉-1 index bits to maintain symmetry
Achieves optimality at c=5: 1.6 bits/weight, fitting exactly in one byte (Figure 6)

Index Reordering:

Reorder indices based on construction path
Ensures sequential LUT entry access
Eliminates need for hazard detection hardware

System-Level Optimization

1. Parallelism Design

N-Dimension Parallelism:

Each PPE processes ncols=8 input blocks
Construction block size is ncols LUT
Each query returns ncols partial sums
Cacti 7.0 analysis shows area efficiency decreases for ncols>8

K and N-Dimension Parallelism:

L=52 PEs process L·c × ncols inputs in parallel
Partial sums directly stream to accumulators, reducing output buffer pressure

2. Utilization Improvement

Resource Imbalance Issue:

Construction phase: 1 adder + 2 LUT ports
Query phase: 2 adders + 2 LUT ports

Solution:

Configure additional adders to fully support reduction phase
LUT port theoretical utilization approaches 100%
Average adder utilization 90.5%

3. Data Tiling and Residency Strategy

Tiling Configuration (Design space exploration, Figure 7):

m_tiled = 1080
k_tiled = 520
n_tiled = 32
mnk-stationary strategy

On-Chip Storage:

272KB for weight/output/input buffers
52KB for LUT
Total 324KB on-chip SRAM

Experimental Setup

Datasets and Models

BitNet-b1.58 Model Suite:

b1.58-l: 700M parameters
b1.58-xl: 1.3B parameters
b1.58-3B: 3B parameters

Workloads:

Prefill Phase: N=1024 (batch size × sequence length)
Decode Phase: N=8
M and K dimensions extracted from BitLinear layers

Hardware Modeling Methodology

RTL Implementation:

SystemVerilog implementation of PPE
Synopsys Design Compiler synthesis
ARM standard cell library
28nm technology node
500 MHz frequency

Memory Modeling:

On-Chip SRAM: Modeled with CACTI 7.0
Off-Chip DRAM: Modeled with DRAMsim3
- 64GB DDR4 2133R
- Maximum bandwidth 64GB/s

Simulator:

Extended open-source Prosperity simulator
Cycle-accurate simulation
Captures computational cycles, memory accesses, PE activity

Comparison Baselines

Accelerator	Type	Frequency	Technology	PEs	Area	Throughput
SpikingEyeriss	ASIC	500MHz	28nm	168	1.07mm²	20.8 GOP/s
Prosperity	ASIC	500MHz	28nm	256	1.06mm²	375 GOP/s
T-MAC	CPU	3490MHz	5nm	-	289mm²	715 GOP/s
Platinum	ASIC	500MHz	28nm	416	0.955mm²	1534 GOP/s

Evaluation Metrics

Performance: Latency (ms), throughput (GOP/s)
Energy Efficiency: Total energy (mJ), energy efficiency ratio
Hardware Cost: Chip area (mm²), power (W)

Experimental Results

Chip Area and Power Decomposition

Area Distribution (Total 0.96mm²):

Weight and activation buffers: 65%
Storage including LUT: 83.3%
Aggregator and PPE (core computation): 15%
Other: 1.7%

Power Distribution (b1.58-3B prefill, 3.2W):

DRAM access: 53.5%
Weight buffer access: 31.6%
LUT buffer: Relatively low
Other: 14.9%

Key Insights:

Storage dominates chip area, highlighting the area efficiency of LUT methods
DRAM and weight access are energy bottlenecks, emphasizing the importance of compact weight encoding
LUT power overhead is low, validating the efficiency of the LUT computation paradigm

Core-Level Performance Comparison

b1.58-3B Model Performance Improvement (Figures 8, 9):

Prefill Phase (N=1024):

vs SpikingEyeriss: 73.6× speedup, 32.4× energy reduction
vs Prosperity: 4.09× speedup, 3.23× energy reduction
vs T-MAC (16-thread): 2.15× speedup, 20.9× energy reduction
vs Platinum-bs (self bit-serial): 1.4× speedup, 1.34× energy reduction

Decode Phase (N=8):

vs SpikingEyeriss: 47.6× speedup, 18.4× energy reduction
vs Prosperity: 28.4× speedup, 15.3× energy reduction
vs T-MAC: 1.75× speedup, 15.0× energy reduction
vs Platinum-bs: 1.3× speedup, 1.31× energy reduction

Performance Advantage Source Analysis

1. Advantages of Offline Path Generation

Eliminates runtime scheduling hardware overhead (24% area + 32.3% power in Prosperity)
More area available for PEs, increasing throughput
Particularly effective for models with uniform weight distribution like BitNet

2. High PE Utilization

ncols=8 design ensures utilization under low-N workloads
Replicated adders fully utilize LUT ports
Prosperity shows insufficient PE utilization under decode loads

3. Ternary Weight Specialized Optimization

Additional 1.3-1.4× speedup compared to bit-serial mode
1.6 bits/weight compact encoding
Direct table lookup avoids partial sum merging overhead

4. High K-Dimension Parallelism

Reduces output data DRAM access frequency
Partial sums stream to accumulators

Cross-Model Consistency

Average Improvements Across Three Models (Figure 10):

b1.58-l, b1.58-xl, b1.58-3B show consistent performance
Significant improvements over baselines in both prefill and decode phases
Demonstrates method generality and scalability

Addition Count Optimization Effect

Figure 5 Analysis:

Addition count comparison across different LUT sizes (16-128 entries)
Platinum achieves minimum addition count across all chunk sizes
Advantage most pronounced at c=5 (combined with ternary LUT and mirror consolidation)

Encoding Efficiency

Figure 6 Analysis:

Pack size c=5 achieves optimal 1.6 bits/parameter
Approaches theoretical optimum of 1.58 bits
Far superior to 2-bit encoding (T-MAC, etc.)

1. Quantization Techniques

Low-Bit Quantization: ANT, Olive, FP8-LM exploring aggressive quantization
Weight-Specific Quantization: AWQ, GPTQ, BitNet series
BitNet-b1.58: Ternary weights {-1,0,1} balancing efficiency and accuracy

2. LUT-Based Acceleration

BIQGEMM: Dynamic programming approach for binary weights
Prosperity: Dynamic "shortcut" detection, but high hardware overhead
T-MAC: Table lookup method on CPU
LUT-GEMM, LUT Tensor Core: Exploring LUT application in low-bit LLMs
Bitnet.cpp: CPU implementation with similar weight encoding strategy

Advantages of This Work:

First ASIC design decoupling path generation to offline
Simultaneously supports generic and precision-specific optimization
Lowest hardware overhead, optimal performance

3. Neural Network Accelerators

Eyeriss: Energy-efficient DNN accelerator
SpinalFlow: Spiking neural network dataflow
BitMod: Mixed data-type bit-serial accelerator

This Work's Position: Focuses on LUT-based ASIC for ultra-low-bit weight networks, targeting edge LLM inference

Conclusions and Discussion

Main Conclusions

Platinum Successfully Achieves Efficient LUT-Based Acceleration:
- Eliminates runtime scheduling overhead through offline path generation
- Achieves 1534 GOP/s throughput within 0.96mm² chip area
- 73.6× speedup and 32.4× energy reduction compared to state-of-the-art baselines
Effectiveness of Path-Adaptive Design:
- Supports both generic bit-serial and ternary-optimized modes
- Ternary optimization provides additional 1.3-1.4× performance improvement
- Good balance between flexibility and specialization
Edge Deployment Potential:
- Lightweight modular design
- High energy efficiency suitable for edge platforms
- Provides scalable solution for ultra-low-bit neural networks

Limitations

1. Model Applicability Scope

Primarily Targets BitNet-Class Models: Uniform weight distribution with most LUT entries utilized
Non-Uniform Distribution Limitations: Offline paths may be suboptimal for sparse or non-uniform weight distributions
Fixed Chunk Size: c=5 optimized for ternary weights; other bit widths may require adjustment

2. Precision Support

Current 8-Bit Activation Limitation: While LUT entries are scalable, higher precision not thoroughly explored
Integer Quantization Assumption: Does not support floating-point or mixed-precision activations

3. Memory Bandwidth Bottleneck

DRAM Access Consumes 53.5% Power: Room for optimization remains
Weight Buffer Access 31.6% Power: Large models may face on-chip storage pressure

4. Generality Trade-offs

SFU Merely Overhead: Focus on GEMM with limited support for other operations
Requires Offline Encoding: Deployment process adds preprocessing steps

Future Directions

1. Extension to More Models

Explore adaptive path generation for non-uniform weight distributions
Support more quantization schemes (e.g., 4-bit, mixed-precision)

2. System-Level Optimization

Investigate more efficient memory hierarchy structures
Explore on-chip compression techniques to further reduce bandwidth requirements

3. Hybrid Dynamic-Static Approaches

Introduce lightweight dynamic adjustment while maintaining low overhead
Adaptively select paths based on layer characteristics

4. Extension to Other Operations

Fully leverage SFU to support complete LLM inference
Explore LUT methods in attention mechanisms

In-Depth Evaluation

Strengths

1. Method Innovation ⭐⭐⭐⭐⭐

Clear Core Innovation: Combination of offline path generation + adaptive execution is original
Solid Theoretical Foundation: MST modeling of LUT construction problem is mathematically elegant
Clever Engineering Implementation:
- Mirror consolidation exploits symmetry
- Compact encoding approaches theoretical optimum
- Four-stage pipeline avoids hazards

2. Experimental Comprehensiveness ⭐⭐⭐⭐⭐

Comprehensive Baseline Comparison: ASIC (Eyeriss, Prosperity) and CPU (T-MAC)
Multi-Model Validation: Three BitNet models of different scales
Multi-Scenario Evaluation: Prefill and decode phases
Detailed Hardware Modeling: RTL synthesis + CACTI + DRAMsim3
Ablation Studies: Platinum vs Platinum-bs validates ternary optimization

3. Result Convincingness ⭐⭐⭐⭐⭐

Significant Performance Gains: 73.6× speedup is not marginal improvement
Clear Energy Efficiency Advantage: 32.4× energy reduction critical for edge deployment
Reasonable Hardware Cost: 0.96mm² very compact for 28nm technology
Data Transparency: Detailed area and power decomposition provided

4. Writing Clarity ⭐⭐⭐⭐

Logical Structure: Background → Method → Experiments flow clearly
Rich Figures: 9 figures effectively support arguments
Complete Technical Details: Algorithm pseudocode and formula derivations comprehensive
Slightly Dense: Some sections information-heavy, requiring careful reading

Weaknesses

1. Method Limitations

Rigidity of Offline Paths: Cannot adapt to runtime changes; potentially suboptimal for non-uniform distributions
Fixed Chunk Size: c=5 optimized for ternary; limited exploration of other configurations
Generalization Insufficiently Verified: Tested only on BitNet; effectiveness on other low-bit models (e.g., 4-bit) unknown

2. Experimental Setup

Baseline Fairness Issues:
- Prosperity scaled to match area, potentially affecting optimal configuration
- T-MAC on 5nm technology; large technology node difference
- SpikingEyeriss from earlier era (2016)
Missing GPU Comparison: No comparison with modern GPUs (A100, H100)
Single Power Test Scenario: Only reports 3.2W for prefill; decode power not detailed

3. Analysis Depth

PE Utilization: Claims 90.5% but lacks detailed analysis
Memory Access Patterns: Insufficient exploration of DRAM bandwidth utilization
Scalability: L=52 choice lacks sufficient justification; larger system performance unknown
Temperature and Reliability: No discussion of thermal design and long-term reliability

4. Practical Considerations

Deployment Complexity: Offline encoding and path generation add deployment steps
Model Adaptation: Requires path regeneration for different models
Open-Source Plans: No mention of code or hardware design open-sourcing; reproducibility questionable

Impact Assessment

1. Academic Contribution ⭐⭐⭐⭐

Pioneering Work: First systematic solution to LUT construction overhead in ASIC design
Methodological Value: MST modeling can inspire other accelerator designs
Citation Potential: Expected high citation in LUT-based acceleration and low-bit inference domains

2. Practical Value ⭐⭐⭐⭐

Edge Deployment: 0.96mm² and high energy efficiency ideal for edge AI chips
Commercialization Potential: BitNet and ternary models' popularity creates real application scenarios
Technology Maturity: Based on mature 28nm technology, can quickly tape out for verification
Limitation: Model-specific dependency; generality needs improvement

3. Reproducibility ⭐⭐⭐

Sufficient Hardware Details: RTL implementation, synthesis parameters, storage configuration detailed
Clear Algorithms: Pseudocode and formulas complete
Explicit Tool Chain: Synopsys DC, CACTI 7.0, DRAMsim3 specified
Missing Elements:
- No open-source code or RTL provided
- Weight encoding implementation details insufficient
- Complete path generation algorithm not publicly available

Applicable Scenarios

Ideal Scenarios ✅

BitNet-Class Ternary Weight Model Inference: Optimal performance
Edge Device LLM Deployment: Strict area and power constraints
Batch Inference Tasks: Prefill phase advantages pronounced
Uniform Weight Distribution Models: High LUT utilization

Suitable Scenarios ⚠️

Generic Low-Bit (2-4 bit) Integer Weight Models: Supported via bit-serial mode
Medium-Scale Models (1-3B): Within experimental validation range
Fixed Model Inference: Offline optimization fully leveraged

Unsuitable Scenarios ❌

Floating-Point or Mixed-Precision Models: Current design unsupported
Dynamic Weights or Online Learning: Offline paths cannot adapt
Extremely Large Models (>10B): On-chip storage potentially insufficient
Highly Sparse or Non-Uniform Weight Distributions: Low LUT utilization

Insights for the Field

Hardware-Software Co-Design: Balance between offline optimization and runtime execution
Specialization vs. Generality Trade-off: Path switching achieves flexibility
Storage-Centric Design: Importance of memory architecture in LUT methods
Quantization-Hardware Matching: Natural fit between ternary weights and LUTs

Selected References

BitNet-b1.58 13: Ma et al., "The era of 1-bit llms: All large language models are in 1.58 bits"
T-MAC 14: Wei et al., "T-MAC: CPU renaissance via table lookup for low-bit LLM deployment on edge"
Prosperity 24: Wei et al., "Prosperity: Accelerating spiking neural networks via product sparsity"
BIQGEMM 18: Jeon et al., "Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns"
Eyeriss 27: Chen et al., "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks"

Summary

Platinum represents significant progress in LUT-based neural network accelerator design. By cleverly decoupling path generation to offline processing and combining adaptive execution modes, it achieves excellent balance among hardware overhead, performance, and energy efficiency. The 73.6× speedup and compact 0.96mm² design make it a compelling solution for edge LLM inference.

However, the work has notable limitations: dependence on specific models (BitNet), limited generality, and absence of open-source implementation. Future research can enhance adaptability while maintaining low overhead, extending to broader quantization schemes and model architectures.

Overall, this is a high-quality computer architecture paper with solid technical innovation and comprehensive experimental evaluation, providing a new design paradigm for low-bit neural network acceleration. Recommended for researchers and engineers working on neural network accelerators, quantized inference, and edge AI chip design.