2025-12-01T05:34:19.512651

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

Shan, Guo, Wei et al.
The rapid scaling of large language models demands more efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there are abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for integer weight mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.
academic

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

Basic Information

  • Paper ID: 2511.21910
  • Title: Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication
  • Authors: Haoxuan Shan, Cong Guo, Chiyue Wei, Feng Cheng, Junyao Zhang, Hai (Helen) Li, Yiran Chen
  • Affiliation: Duke University, Department of Electrical and Computer Engineering
  • Category: cs.AR (Computer Architecture)
  • Submission Date: November 26, 2025 to arXiv
  • Paper Link: https://arxiv.org/abs/2511.21910

Abstract

The rapid expansion of large language models imposes stringent requirements on hardware efficiency. Quantization techniques offer promising trade-offs between efficiency and performance. Ultra-low-bit quantization creates abundant opportunities for result reuse, which can be accelerated through lookup tables (LUTs). However, existing LUT methods incur significant computational and hardware overhead in LUT construction and rely solely on bit-serial computation, which is suboptimal for ternary weight networks. This paper proposes Platinum, a lightweight ASIC accelerator for mixed-precision integer weight matrix multiplication (mpGEMM). Platinum reduces LUT construction overhead through offline-generated construction paths and supports both generic bit-serial and optimized ternary weight execution via adaptive path switching. On BitNet b1.58-3B, Platinum achieves 73.6×, 4.09×, and 2.15× speedups compared to SpikingEyeriss, Prosperity, and 16-thread T-MAC respectively, with 32.4×, 3.23×, and 20.9× energy reduction, while occupying only 0.96mm² of chip area.

Research Background and Motivation

1. Core Problem to Address

With the rapid growth of deep neural networks, particularly large language models (LLMs), energy consumption and computational latency have become major deployment challenges. General matrix multiplication (GEMM) dominates fully connected and attention layers, with computational burden scaling proportionally with model size.

2. Problem Significance

  • Energy Efficiency Requirements: LLM inference must run efficiently on edge devices
  • Real-time Constraints: Reducing computational latency is critical for user experience
  • Hardware Cost: Achieving high performance within limited chip area and power budgets

3. Limitations of Existing Approaches

Opportunities in Quantization:

  • Ultra-low-bit quantization (e.g., ternary weights {-1,0,1} in BitNet-b1.58) significantly improves efficiency while maintaining accuracy
  • Low-bit quantization enables LUT-based acceleration strategies through result precomputation and reuse

Problems with Existing LUT Methods:

  • Prosperity and Similar Methods: Dynamic LUT construction path scheduling incurs high hardware overhead (24% chip area, 32.3% power for scheduling modules)
  • Inefficiency of Bit-Serial Computation: Using 2-bit encoding for ternary weights exceeds the theoretical optimum of 1.58 bits (log₂3), with additional overhead from partial sum merging
  • Infeasibility of Precomputation: Offline precomputation of all LUT entries requires enormous storage (4GB for 8-bit activations with k=2)

4. Research Motivation

  • For models like BitNet with uniform weight distribution, most LUT entries are utilized (only 1.16% unused), making dynamic scheduling overhead unnecessary
  • Ternary LUTs directly represent final results, with experiments showing 1.3× or greater performance improvement compared to binary LUTs
  • There is a need for a lightweight, energy-efficient specialized accelerator that simultaneously supports generic integer weights and optimized execution for specific bit widths

Core Contributions

  1. Platinum Accelerator Architecture: Designs a novel LUT-based mpGEMM accelerator with a decoupled path-based LUT construction framework that reduces LUT generation costs and minimizes hardware overhead
  2. Path-Adaptive Execution: Through path switching, supports both generic bit-serial execution for integer weights and optimized execution for specific precisions (e.g., ternary weights)
  3. System-Level Optimization Design:
    • Architecture optimized for parallelism and data flow
    • Lightweight modular design suitable for edge deployment
    • Chip area of only 0.96mm²
  4. Outstanding Performance Results: On BitNet b1.58-3B achieves:
    • Up to 73.6× speedup compared to state-of-the-art baselines
    • 32.4× energy reduction
    • Demonstrates the potential of LUT-based ASIC as a scalable solution for ultra-low-bit neural networks on edge platforms

Methodology Details

Task Definition

Mixed-Precision GEMM (mpGEMM):

  • Input: Weight matrix W (m×k, low-bit integer), activation matrix X (k×n, 8-bit integer)
  • Output: Result matrix Y (m×n)
  • Objective: Efficiently compute Y = W·X, with particular optimization for ternary weight scenarios

Overall Architecture Design

Platinum Processor Composition (Figure 3):

  1. L Platinum Processing Elements (PPEs): Each containing a controller, adder, and dedicated LUT buffer
  2. Aggregator: Shares adders across PPEs, combined with additional adders forming a pipelined adder tree
  3. High-Bandwidth On-Chip Buffers: Including weight, input, output, and construction path buffers
  4. Special Function Unit (SFU): Supporting operations beyond GEMM (e.g., vector multiplication, activation functions)

Key Parameters:

  • L = 52 PPEs
  • 8-bit per LUT entry (aligned with BitNet's 8-bit activations)
  • Chunk size c = 5 for ternary weights (generating 128-entry LUT)
  • Each PPE processes ncols = 8 input columns

LUT Construction Method Innovation

1. Offline Path Generation (Based on Minimum Spanning Tree MST)

Problem Formulation:

  • Formalizes LUT construction as a directed hypergraph
  • Each node represents a LUT entry
  • Each hyperedge represents a computational operation

MST Algorithm Application:

Source node: lut[0] = 0
Operation constraint: Only addition/subtraction of input elements
Goal: Find minimum-cost path connecting all nodes

Advantages:

  • Exploits symmetry to reduce LUT size to ⌈3^c/2⌉
  • For c=5, reduces addition operations by ~10× compared to naive construction
  • Ensures correct data dependencies (topological sort)
  • Shortest read-after-write (RAW) dependency distance exceeds pipeline stages, eliminating need for additional hazard handling

2. Four-Stage Construction Pipeline (Figure 4)

Stage 1: Load construction path (dst, src, j, sign)
Stage 2: LUT read + input access
Stage 3: Adder computation lut[src] ± a[j]
Stage 4: LUT write-back

Path Format:

(dst, src, j, flip) represents lut[dst] = lut[src] ± aj

Ternary Weight Optimization

1. Computational Complexity Analysis

Bit-Serial Method (Equation 1):

#add_bs = [⌈K/c⌉·c·2^c + M·⌈K/c⌉ + M(⌈K/c⌉-1)]·N

Ternary LUT Method (Equation 2):

#add_ter = [⌈K/c⌉·c·3^c + M(⌈K/c⌉-1)]·N

Platinum Optimized Method (Equation 3):

#add_platinum = [⌈K/c⌉·⌈3^c/2⌉ + M(⌈K/c⌉-1)]·N

Exploits symmetry through mirror consolidation to reduce LUT size and construction cost.

2. Compact Weight Encoding

Problem:

  • 2-bit encoding: Far exceeds theoretical optimum of 1.58 bits
  • Byte storage: Extremely redundant

Solution:

  • Pack every c ternary weights into a base-3 integer
  • Requires ⌈log₂3^c⌉ bits
  • Further divided into 1 sign bit and ⌈log₂3^c⌉-1 index bits to maintain symmetry
  • Achieves optimality at c=5: 1.6 bits/weight, fitting exactly in one byte (Figure 6)

Index Reordering:

  • Reorder indices based on construction path
  • Ensures sequential LUT entry access
  • Eliminates need for hazard detection hardware

System-Level Optimization

1. Parallelism Design

N-Dimension Parallelism:

  • Each PPE processes ncols=8 input blocks
  • Construction block size is ncols LUT
  • Each query returns ncols partial sums
  • Cacti 7.0 analysis shows area efficiency decreases for ncols>8

K and N-Dimension Parallelism:

  • L=52 PEs process L·c × ncols inputs in parallel
  • Partial sums directly stream to accumulators, reducing output buffer pressure

2. Utilization Improvement

Resource Imbalance Issue:

  • Construction phase: 1 adder + 2 LUT ports
  • Query phase: 2 adders + 2 LUT ports

Solution:

  • Configure additional adders to fully support reduction phase
  • LUT port theoretical utilization approaches 100%
  • Average adder utilization 90.5%

3. Data Tiling and Residency Strategy

Tiling Configuration (Design space exploration, Figure 7):

  • m_tiled = 1080
  • k_tiled = 520
  • n_tiled = 32
  • mnk-stationary strategy

On-Chip Storage:

  • 272KB for weight/output/input buffers
  • 52KB for LUT
  • Total 324KB on-chip SRAM

Experimental Setup

Datasets and Models

BitNet-b1.58 Model Suite:

  • b1.58-l: 700M parameters
  • b1.58-xl: 1.3B parameters
  • b1.58-3B: 3B parameters

Workloads:

  • Prefill Phase: N=1024 (batch size × sequence length)
  • Decode Phase: N=8
  • M and K dimensions extracted from BitLinear layers

Hardware Modeling Methodology

RTL Implementation:

  • SystemVerilog implementation of PPE
  • Synopsys Design Compiler synthesis
  • ARM standard cell library
  • 28nm technology node
  • 500 MHz frequency

Memory Modeling:

  • On-Chip SRAM: Modeled with CACTI 7.0
  • Off-Chip DRAM: Modeled with DRAMsim3
    • 64GB DDR4 2133R
    • Maximum bandwidth 64GB/s

Simulator:

  • Extended open-source Prosperity simulator
  • Cycle-accurate simulation
  • Captures computational cycles, memory accesses, PE activity

Comparison Baselines

AcceleratorTypeFrequencyTechnologyPEsAreaThroughput
SpikingEyerissASIC500MHz28nm1681.07mm²20.8 GOP/s
ProsperityASIC500MHz28nm2561.06mm²375 GOP/s
T-MACCPU3490MHz5nm-289mm²715 GOP/s
PlatinumASIC500MHz28nm4160.955mm²1534 GOP/s

Evaluation Metrics

  • Performance: Latency (ms), throughput (GOP/s)
  • Energy Efficiency: Total energy (mJ), energy efficiency ratio
  • Hardware Cost: Chip area (mm²), power (W)

Experimental Results

Chip Area and Power Decomposition

Area Distribution (Total 0.96mm²):

  • Weight and activation buffers: 65%
  • Storage including LUT: 83.3%
  • Aggregator and PPE (core computation): 15%
  • Other: 1.7%

Power Distribution (b1.58-3B prefill, 3.2W):

  • DRAM access: 53.5%
  • Weight buffer access: 31.6%
  • LUT buffer: Relatively low
  • Other: 14.9%

Key Insights:

  • Storage dominates chip area, highlighting the area efficiency of LUT methods
  • DRAM and weight access are energy bottlenecks, emphasizing the importance of compact weight encoding
  • LUT power overhead is low, validating the efficiency of the LUT computation paradigm

Core-Level Performance Comparison

b1.58-3B Model Performance Improvement (Figures 8, 9):

Prefill Phase (N=1024):

  • vs SpikingEyeriss: 73.6× speedup, 32.4× energy reduction
  • vs Prosperity: 4.09× speedup, 3.23× energy reduction
  • vs T-MAC (16-thread): 2.15× speedup, 20.9× energy reduction
  • vs Platinum-bs (self bit-serial): 1.4× speedup, 1.34× energy reduction

Decode Phase (N=8):

  • vs SpikingEyeriss: 47.6× speedup, 18.4× energy reduction
  • vs Prosperity: 28.4× speedup, 15.3× energy reduction
  • vs T-MAC: 1.75× speedup, 15.0× energy reduction
  • vs Platinum-bs: 1.3× speedup, 1.31× energy reduction

Performance Advantage Source Analysis

1. Advantages of Offline Path Generation

  • Eliminates runtime scheduling hardware overhead (24% area + 32.3% power in Prosperity)
  • More area available for PEs, increasing throughput
  • Particularly effective for models with uniform weight distribution like BitNet

2. High PE Utilization

  • ncols=8 design ensures utilization under low-N workloads
  • Replicated adders fully utilize LUT ports
  • Prosperity shows insufficient PE utilization under decode loads

3. Ternary Weight Specialized Optimization

  • Additional 1.3-1.4× speedup compared to bit-serial mode
  • 1.6 bits/weight compact encoding
  • Direct table lookup avoids partial sum merging overhead

4. High K-Dimension Parallelism

  • Reduces output data DRAM access frequency
  • Partial sums stream to accumulators

Cross-Model Consistency

Average Improvements Across Three Models (Figure 10):

  • b1.58-l, b1.58-xl, b1.58-3B show consistent performance
  • Significant improvements over baselines in both prefill and decode phases
  • Demonstrates method generality and scalability

Addition Count Optimization Effect

Figure 5 Analysis:

  • Addition count comparison across different LUT sizes (16-128 entries)
  • Platinum achieves minimum addition count across all chunk sizes
  • Advantage most pronounced at c=5 (combined with ternary LUT and mirror consolidation)

Encoding Efficiency

Figure 6 Analysis:

  • Pack size c=5 achieves optimal 1.6 bits/parameter
  • Approaches theoretical optimum of 1.58 bits
  • Far superior to 2-bit encoding (T-MAC, etc.)

1. Quantization Techniques

  • Low-Bit Quantization: ANT, Olive, FP8-LM exploring aggressive quantization
  • Weight-Specific Quantization: AWQ, GPTQ, BitNet series
  • BitNet-b1.58: Ternary weights {-1,0,1} balancing efficiency and accuracy

2. LUT-Based Acceleration

  • BIQGEMM: Dynamic programming approach for binary weights
  • Prosperity: Dynamic "shortcut" detection, but high hardware overhead
  • T-MAC: Table lookup method on CPU
  • LUT-GEMM, LUT Tensor Core: Exploring LUT application in low-bit LLMs
  • Bitnet.cpp: CPU implementation with similar weight encoding strategy

Advantages of This Work:

  • First ASIC design decoupling path generation to offline
  • Simultaneously supports generic and precision-specific optimization
  • Lowest hardware overhead, optimal performance

3. Neural Network Accelerators

  • Eyeriss: Energy-efficient DNN accelerator
  • SpinalFlow: Spiking neural network dataflow
  • BitMod: Mixed data-type bit-serial accelerator

This Work's Position: Focuses on LUT-based ASIC for ultra-low-bit weight networks, targeting edge LLM inference

Conclusions and Discussion

Main Conclusions

  1. Platinum Successfully Achieves Efficient LUT-Based Acceleration:
    • Eliminates runtime scheduling overhead through offline path generation
    • Achieves 1534 GOP/s throughput within 0.96mm² chip area
    • 73.6× speedup and 32.4× energy reduction compared to state-of-the-art baselines
  2. Effectiveness of Path-Adaptive Design:
    • Supports both generic bit-serial and ternary-optimized modes
    • Ternary optimization provides additional 1.3-1.4× performance improvement
    • Good balance between flexibility and specialization
  3. Edge Deployment Potential:
    • Lightweight modular design
    • High energy efficiency suitable for edge platforms
    • Provides scalable solution for ultra-low-bit neural networks

Limitations

1. Model Applicability Scope

  • Primarily Targets BitNet-Class Models: Uniform weight distribution with most LUT entries utilized
  • Non-Uniform Distribution Limitations: Offline paths may be suboptimal for sparse or non-uniform weight distributions
  • Fixed Chunk Size: c=5 optimized for ternary weights; other bit widths may require adjustment

2. Precision Support

  • Current 8-Bit Activation Limitation: While LUT entries are scalable, higher precision not thoroughly explored
  • Integer Quantization Assumption: Does not support floating-point or mixed-precision activations

3. Memory Bandwidth Bottleneck

  • DRAM Access Consumes 53.5% Power: Room for optimization remains
  • Weight Buffer Access 31.6% Power: Large models may face on-chip storage pressure

4. Generality Trade-offs

  • SFU Merely Overhead: Focus on GEMM with limited support for other operations
  • Requires Offline Encoding: Deployment process adds preprocessing steps

Future Directions

1. Extension to More Models

  • Explore adaptive path generation for non-uniform weight distributions
  • Support more quantization schemes (e.g., 4-bit, mixed-precision)

2. System-Level Optimization

  • Investigate more efficient memory hierarchy structures
  • Explore on-chip compression techniques to further reduce bandwidth requirements

3. Hybrid Dynamic-Static Approaches

  • Introduce lightweight dynamic adjustment while maintaining low overhead
  • Adaptively select paths based on layer characteristics

4. Extension to Other Operations

  • Fully leverage SFU to support complete LLM inference
  • Explore LUT methods in attention mechanisms

In-Depth Evaluation

Strengths

1. Method Innovation ⭐⭐⭐⭐⭐

  • Clear Core Innovation: Combination of offline path generation + adaptive execution is original
  • Solid Theoretical Foundation: MST modeling of LUT construction problem is mathematically elegant
  • Clever Engineering Implementation:
    • Mirror consolidation exploits symmetry
    • Compact encoding approaches theoretical optimum
    • Four-stage pipeline avoids hazards

2. Experimental Comprehensiveness ⭐⭐⭐⭐⭐

  • Comprehensive Baseline Comparison: ASIC (Eyeriss, Prosperity) and CPU (T-MAC)
  • Multi-Model Validation: Three BitNet models of different scales
  • Multi-Scenario Evaluation: Prefill and decode phases
  • Detailed Hardware Modeling: RTL synthesis + CACTI + DRAMsim3
  • Ablation Studies: Platinum vs Platinum-bs validates ternary optimization

3. Result Convincingness ⭐⭐⭐⭐⭐

  • Significant Performance Gains: 73.6× speedup is not marginal improvement
  • Clear Energy Efficiency Advantage: 32.4× energy reduction critical for edge deployment
  • Reasonable Hardware Cost: 0.96mm² very compact for 28nm technology
  • Data Transparency: Detailed area and power decomposition provided

4. Writing Clarity ⭐⭐⭐⭐

  • Logical Structure: Background → Method → Experiments flow clearly
  • Rich Figures: 9 figures effectively support arguments
  • Complete Technical Details: Algorithm pseudocode and formula derivations comprehensive
  • Slightly Dense: Some sections information-heavy, requiring careful reading

Weaknesses

1. Method Limitations

  • Rigidity of Offline Paths: Cannot adapt to runtime changes; potentially suboptimal for non-uniform distributions
  • Fixed Chunk Size: c=5 optimized for ternary; limited exploration of other configurations
  • Generalization Insufficiently Verified: Tested only on BitNet; effectiveness on other low-bit models (e.g., 4-bit) unknown

2. Experimental Setup

  • Baseline Fairness Issues:
    • Prosperity scaled to match area, potentially affecting optimal configuration
    • T-MAC on 5nm technology; large technology node difference
    • SpikingEyeriss from earlier era (2016)
  • Missing GPU Comparison: No comparison with modern GPUs (A100, H100)
  • Single Power Test Scenario: Only reports 3.2W for prefill; decode power not detailed

3. Analysis Depth

  • PE Utilization: Claims 90.5% but lacks detailed analysis
  • Memory Access Patterns: Insufficient exploration of DRAM bandwidth utilization
  • Scalability: L=52 choice lacks sufficient justification; larger system performance unknown
  • Temperature and Reliability: No discussion of thermal design and long-term reliability

4. Practical Considerations

  • Deployment Complexity: Offline encoding and path generation add deployment steps
  • Model Adaptation: Requires path regeneration for different models
  • Open-Source Plans: No mention of code or hardware design open-sourcing; reproducibility questionable

Impact Assessment

1. Academic Contribution ⭐⭐⭐⭐

  • Pioneering Work: First systematic solution to LUT construction overhead in ASIC design
  • Methodological Value: MST modeling can inspire other accelerator designs
  • Citation Potential: Expected high citation in LUT-based acceleration and low-bit inference domains

2. Practical Value ⭐⭐⭐⭐

  • Edge Deployment: 0.96mm² and high energy efficiency ideal for edge AI chips
  • Commercialization Potential: BitNet and ternary models' popularity creates real application scenarios
  • Technology Maturity: Based on mature 28nm technology, can quickly tape out for verification
  • Limitation: Model-specific dependency; generality needs improvement

3. Reproducibility ⭐⭐⭐

  • Sufficient Hardware Details: RTL implementation, synthesis parameters, storage configuration detailed
  • Clear Algorithms: Pseudocode and formulas complete
  • Explicit Tool Chain: Synopsys DC, CACTI 7.0, DRAMsim3 specified
  • Missing Elements:
    • No open-source code or RTL provided
    • Weight encoding implementation details insufficient
    • Complete path generation algorithm not publicly available

Applicable Scenarios

Ideal Scenarios ✅

  1. BitNet-Class Ternary Weight Model Inference: Optimal performance
  2. Edge Device LLM Deployment: Strict area and power constraints
  3. Batch Inference Tasks: Prefill phase advantages pronounced
  4. Uniform Weight Distribution Models: High LUT utilization

Suitable Scenarios ⚠️

  1. Generic Low-Bit (2-4 bit) Integer Weight Models: Supported via bit-serial mode
  2. Medium-Scale Models (1-3B): Within experimental validation range
  3. Fixed Model Inference: Offline optimization fully leveraged

Unsuitable Scenarios ❌

  1. Floating-Point or Mixed-Precision Models: Current design unsupported
  2. Dynamic Weights or Online Learning: Offline paths cannot adapt
  3. Extremely Large Models (>10B): On-chip storage potentially insufficient
  4. Highly Sparse or Non-Uniform Weight Distributions: Low LUT utilization

Insights for the Field

  1. Hardware-Software Co-Design: Balance between offline optimization and runtime execution
  2. Specialization vs. Generality Trade-off: Path switching achieves flexibility
  3. Storage-Centric Design: Importance of memory architecture in LUT methods
  4. Quantization-Hardware Matching: Natural fit between ternary weights and LUTs

Selected References

  1. BitNet-b1.58 13: Ma et al., "The era of 1-bit llms: All large language models are in 1.58 bits"
  2. T-MAC 14: Wei et al., "T-MAC: CPU renaissance via table lookup for low-bit LLM deployment on edge"
  3. Prosperity 24: Wei et al., "Prosperity: Accelerating spiking neural networks via product sparsity"
  4. BIQGEMM 18: Jeon et al., "Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns"
  5. Eyeriss 27: Chen et al., "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks"

Summary

Platinum represents significant progress in LUT-based neural network accelerator design. By cleverly decoupling path generation to offline processing and combining adaptive execution modes, it achieves excellent balance among hardware overhead, performance, and energy efficiency. The 73.6× speedup and compact 0.96mm² design make it a compelling solution for edge LLM inference.

However, the work has notable limitations: dependence on specific models (BitNet), limited generality, and absence of open-source implementation. Future research can enhance adaptability while maintaining low overhead, extending to broader quantization schemes and model architectures.

Overall, this is a high-quality computer architecture paper with solid technical innovation and comprehensive experimental evaluation, providing a new design paradigm for low-bit neural network acceleration. Recommended for researchers and engineers working on neural network accelerators, quantized inference, and edge AI chip design.