2025-11-19T05:31:14.213589

Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference

Nair, Vellaisamy, Lin et al.

General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.

academic

Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference

Basic Information

Paper ID: 2402.19376
Title: Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference
Authors: Harideep Nair, Prabhu Vellaisamy, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen
Institutions: Carnegie Mellon University, MediaTek USA Inc.
Category: cs.AR (Computer Architecture)
Publication Date: February 2024
Paper Link: https://arxiv.org/abs/2402.19376

Abstract

This paper proposes OzMAC (Omit-zero-MAC), an improved implementation of the Bit-Pragmatic (PRA) MAC design specifically designed to exploit bit sparsity in deep learning inference. Unlike previous work, this paper conducts rigorous post-synthesis evaluation using commercial-grade TSMC N5 process technology across multiple bit-widths and clock frequencies. The study demonstrates high bit sparsity in eight pre-trained INT8 deep learning workloads, with 8-bit OzMAC achieving significant improvements of 21%, 70%, and 28% in area, power consumption, and energy efficiency, respectively.

Research Background and Motivation

Problem Definition

Computational Bottleneck: The multiply-accumulate (MAC) array in general matrix multiplication (GEMM) units is the core computational structure of deep learning accelerators, with its efficiency directly impacting overall performance
Precision Trends: Industry standards are transitioning from 32-bit floating-point (FP32) to 16-bit floating-point (FP16), 8-bit integer (INT8), and even lower precision formats
Energy Efficiency Requirements: Edge inference applications impose strict constraints on area, power consumption, and energy efficiency

Research Motivation

Deep learning models contain substantial bit sparsity, i.e., numerous '0' bits in binary representations
While existing Bit-Pragmatic (PRA) designs propose the concept of exploiting bit sparsity, they lack rigorous evaluation using commercial-grade processes
There is a need to assess the feasibility and benefits of zero-skipping MAC designs in practical commercial implementations

Core Contributions

OzMAC Design: An improved zero-skipping MAC architecture based on PRA that dynamically exploits bit sparsity by skipping zero bits in binary values
Commercial-Grade Evaluation: Rigorous power-performance-area (PPA) evaluation using TSMC N5 (5nm) process technology and commercial design tools
Multi-Dimensional Analysis: Comprehensive evaluation across multiple data precisions (4-bit, 8-bit, 16-bit) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz)
Sparsity Verification: Validation of high bit sparsity across eight deep learning models and demonstration of how to leverage power reduction to increase throughput

Methodology Details

OzMAC Microarchitecture Design

OzMAC comprises three core functional modules:

Oz-encoder (Zero Encoder):
- Finite state machine that tracks the current and next positions of '1' bits in input bit patterns
- Outputs one-hot encoded values capturing the position of '1' bits each clock cycle
- Example: Input '0101₂' is encoded as two one-hot values across two clock cycles: first cycle '0100₂', next cycle '0001₂'
Shifter:
- Determines the shift amount for the second input based on Oz-encoder output
- Unlike PRA's binary shift values, OzMAC employs one-hot representation to simplify shifter hardware
Accumulator:
- Adds the appropriately shifted second input to the accumulator value

Technical Innovations

Zero-Skipping Mechanism: Performs computation only on '1' bits while skipping '0' bits, reducing computational cycles
Shifter Optimization: Uses one-hot encoded input to simplify shifter gate complexity
Serial Computation: Trades latency for lower area and power consumption

Experimental Setup

Evaluation Framework

Process Technology: TSMC N5 (5nm) commercial process
Design Tools: Synopsys VCS, SpyGlass, Design Compiler, PrimeTime PX
Verification Method: SystemVerilog RTL design, gate-level netlist simulation, SAIF dump for precise power calculation

Datasets and Models

Eight pre-trained quantized INT8 models from PyTorch Torchvision library:

MobileNetV2, MobileNetV3
InceptionV3, ShuffleNetV2
GoogleNet, ResNet18, ResNet50, ResNeXt101

Evaluation Metrics

Area: Chip area (μm²)
Power: Dynamic power consumption (mW)
Latency: Computational latency (ns)
Energy: Energy per operation (pJ)

Test Configurations

Precision Configurations: 4×4, 4×8, 8×8, 8×16, 16×16 bits
Frequency Range: 500 MHz, 1 GHz, 1.5 GHz
Baseline Comparison: Traditional bit-parallel bMAC design

Experimental Results

Bit Sparsity Analysis

Model	Average '1' Bits	Bit Sparsity Percentage
MobileNetV2	2.334	70.83%
MobileNetV3	1.711	78.61%
InceptionV3	2.430	69.62%
ShuffleNetV2	2.583	67.71%
GoogleNet	2.461	69.24%
ResNet18	2.398	70.02%
ResNet50	2.495	68.81%
ResNeXt101	2.289	71.39%

All models exhibit approximately 70% bit sparsity, with MobileNetV3 achieving the highest at 78.61%.

Primary PPA Results (8-bit, 500 MHz)

MAC Hardware	Area (μm²)	Power (mW)	Latency (ns)	Energy (pJ)
bMAC	25.361	0.084	2	0.167
OzMAC	19.996	0.025	4.76	0.120
Improvement Percentage	21.2%	69.7%	-	28.0%

Precision Scaling Analysis

Results across different precision configurations show:

Best Area Improvement: 31.7% achieved in 8×16 configuration
Best Energy Improvement: 45% achieved in mixed-precision 4×8 and 8×16 configurations
Critical Point: Energy improvement disappears in 16×16 configuration (-1.2%)

Frequency Scaling Analysis

Iso-Frequency Evaluation: Across the 500 MHz to 1.5 GHz range, OzMAC consistently maintains approximately 70% power improvement and 29% energy improvement
Iso-Latency Evaluation: After frequency scaling to match throughput, OzMAC still achieves:
- INT4 designs: 29% power/energy improvement
- INT8 designs: 30% power/energy improvement
- Mixed-precision designs: up to 46% improvement

Key Findings

Energy Efficiency Threshold: OzMAC requires at least 58% bit sparsity to maintain superior energy efficiency compared to bMAC
Practical Sparsity: All tested DL models exceed this threshold
Scaling Characteristics: Power scales linearly with frequency, while energy efficiency remains essentially constant

This paper builds upon the following related research:

Bit-Pragmatic (PRA): Original bit-pragmatic deep neural network computation method
Bit-Tactical: Software/hardware approach exploiting value and bit sparsity
STRIPES: Bit-serial deep neural network computation
Bit Fusion: Bit-level dynamically composable architecture

The primary distinction of this work is rigorous evaluation using the latest commercial process technology and extension to multiple precision and frequency configurations.

Conclusions and Discussion

Main Conclusions

Significant Improvements: OzMAC achieves significant improvements in area, power consumption, and energy efficiency relative to traditional bMAC
Commercial Feasibility: Evaluation using TSMC N5 process demonstrates commercial implementation viability
Scaling Advantages: Maintains advantages across multiple precision and frequency configurations
Throughput Matching: Frequency scaling enables matching or exceeding bMAC throughput while maintaining energy efficiency advantages

Limitations

Latency Overhead: OzMAC's multi-cycle latency may not be suitable for latency-sensitive applications
Precision Limitations: Advantages disappear at precisions above 16-bit
Sparsity Dependency: Performance heavily depends on input data bit sparsity
Missing System-Level Evaluation: Evaluation at actual DLA system level remains incomplete

Future Directions

System-Level Integration: Evaluate large OzMAC arrays in actual DLA implementations
Adaptive Design: Dynamically adjust configurations based on runtime sparsity
Hybrid Architecture: Combined design incorporating both OzMAC and traditional MAC

In-Depth Evaluation

Strengths

Rigorous Evaluation: Comprehensive evaluation using commercial-grade process and tools with high result credibility
Multi-Dimensional Analysis: Systematic analysis across precision and frequency dimensions
Practical Value: Validates bit sparsity existence in actual DL models
Clear Presentation: Technical details are clearly described with complete experimental setup

Weaknesses

Limited Innovation: Primarily engineering implementation and evaluation of existing PRA design with relatively limited technical novelty
Limited Application Scope: Only applicable to workloads with high bit sparsity
Insufficient System Considerations: Lacks consideration of memory bandwidth, data flow, and other system-level factors
Limited Comparisons: Primarily compares against baseline bMAC, lacking comparison with other advanced MAC designs

Impact

Engineering Value: Provides valuable reference data for commercial DLA design
Methodological Contribution: Establishes rigorous MAC design evaluation framework
Practical Guidance: Provides feasible hardware optimization solutions for low-precision inference applications

Applicable Scenarios

Edge Inference: Power and area-constrained edge AI applications
Low-Precision Computing: Deep learning inference at 8-bit precision and below
Sparse Models: Neural network models with high bit sparsity characteristics
Mass Production: Large-scale deployment scenarios requiring commercial-grade process validation

References

Sze, V., et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture, 2020.
Albericio, J., et al. "Bit-pragmatic deep neural network computing." MICRO, 2017.
Delmas Lascorz, A., et al. "Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks." ASPLOS, 2019.
Judd, P., et al. "Stripes: Bit-serial deep neural network computing." MICRO, 2016.
Sharma, H., et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." ISCA, 2018.

This paper provides important engineering validation for the commercialization of zero-skipping MAC design. While technical innovation is limited, its rigorous evaluation methodology and practical results hold significant value for advancing the development of low-power AI accelerators.