2025-11-19T05:31:14.213589

Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference

Nair, Vellaisamy, Lin et al.
General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.
academic

Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference

Basic Information

  • Paper ID: 2402.19376
  • Title: Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference
  • Authors: Harideep Nair, Prabhu Vellaisamy, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen
  • Institutions: Carnegie Mellon University, MediaTek USA Inc.
  • Category: cs.AR (Computer Architecture)
  • Publication Date: February 2024
  • Paper Link: https://arxiv.org/abs/2402.19376

Abstract

This paper proposes OzMAC (Omit-zero-MAC), an improved implementation of the Bit-Pragmatic (PRA) MAC design specifically designed to exploit bit sparsity in deep learning inference. Unlike previous work, this paper conducts rigorous post-synthesis evaluation using commercial-grade TSMC N5 process technology across multiple bit-widths and clock frequencies. The study demonstrates high bit sparsity in eight pre-trained INT8 deep learning workloads, with 8-bit OzMAC achieving significant improvements of 21%, 70%, and 28% in area, power consumption, and energy efficiency, respectively.

Research Background and Motivation

Problem Definition

  1. Computational Bottleneck: The multiply-accumulate (MAC) array in general matrix multiplication (GEMM) units is the core computational structure of deep learning accelerators, with its efficiency directly impacting overall performance
  2. Precision Trends: Industry standards are transitioning from 32-bit floating-point (FP32) to 16-bit floating-point (FP16), 8-bit integer (INT8), and even lower precision formats
  3. Energy Efficiency Requirements: Edge inference applications impose strict constraints on area, power consumption, and energy efficiency

Research Motivation

  • Deep learning models contain substantial bit sparsity, i.e., numerous '0' bits in binary representations
  • While existing Bit-Pragmatic (PRA) designs propose the concept of exploiting bit sparsity, they lack rigorous evaluation using commercial-grade processes
  • There is a need to assess the feasibility and benefits of zero-skipping MAC designs in practical commercial implementations

Core Contributions

  1. OzMAC Design: An improved zero-skipping MAC architecture based on PRA that dynamically exploits bit sparsity by skipping zero bits in binary values
  2. Commercial-Grade Evaluation: Rigorous power-performance-area (PPA) evaluation using TSMC N5 (5nm) process technology and commercial design tools
  3. Multi-Dimensional Analysis: Comprehensive evaluation across multiple data precisions (4-bit, 8-bit, 16-bit) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz)
  4. Sparsity Verification: Validation of high bit sparsity across eight deep learning models and demonstration of how to leverage power reduction to increase throughput

Methodology Details

OzMAC Microarchitecture Design

OzMAC comprises three core functional modules:

  1. Oz-encoder (Zero Encoder):
    • Finite state machine that tracks the current and next positions of '1' bits in input bit patterns
    • Outputs one-hot encoded values capturing the position of '1' bits each clock cycle
    • Example: Input '0101₂' is encoded as two one-hot values across two clock cycles: first cycle '0100₂', next cycle '0001₂'
  2. Shifter:
    • Determines the shift amount for the second input based on Oz-encoder output
    • Unlike PRA's binary shift values, OzMAC employs one-hot representation to simplify shifter hardware
  3. Accumulator:
    • Adds the appropriately shifted second input to the accumulator value

Technical Innovations

  1. Zero-Skipping Mechanism: Performs computation only on '1' bits while skipping '0' bits, reducing computational cycles
  2. Shifter Optimization: Uses one-hot encoded input to simplify shifter gate complexity
  3. Serial Computation: Trades latency for lower area and power consumption

Experimental Setup

Evaluation Framework

  • Process Technology: TSMC N5 (5nm) commercial process
  • Design Tools: Synopsys VCS, SpyGlass, Design Compiler, PrimeTime PX
  • Verification Method: SystemVerilog RTL design, gate-level netlist simulation, SAIF dump for precise power calculation

Datasets and Models

Eight pre-trained quantized INT8 models from PyTorch Torchvision library:

  • MobileNetV2, MobileNetV3
  • InceptionV3, ShuffleNetV2
  • GoogleNet, ResNet18, ResNet50, ResNeXt101

Evaluation Metrics

  • Area: Chip area (μm²)
  • Power: Dynamic power consumption (mW)
  • Latency: Computational latency (ns)
  • Energy: Energy per operation (pJ)

Test Configurations

  1. Precision Configurations: 4×4, 4×8, 8×8, 8×16, 16×16 bits
  2. Frequency Range: 500 MHz, 1 GHz, 1.5 GHz
  3. Baseline Comparison: Traditional bit-parallel bMAC design

Experimental Results

Bit Sparsity Analysis

ModelAverage '1' BitsBit Sparsity Percentage
MobileNetV22.33470.83%
MobileNetV31.71178.61%
InceptionV32.43069.62%
ShuffleNetV22.58367.71%
GoogleNet2.46169.24%
ResNet182.39870.02%
ResNet502.49568.81%
ResNeXt1012.28971.39%

All models exhibit approximately 70% bit sparsity, with MobileNetV3 achieving the highest at 78.61%.

Primary PPA Results (8-bit, 500 MHz)

MAC HardwareArea (μm²)Power (mW)Latency (ns)Energy (pJ)
bMAC25.3610.08420.167
OzMAC19.9960.0254.760.120
Improvement Percentage21.2%69.7%-28.0%

Precision Scaling Analysis

Results across different precision configurations show:

  • Best Area Improvement: 31.7% achieved in 8×16 configuration
  • Best Energy Improvement: 45% achieved in mixed-precision 4×8 and 8×16 configurations
  • Critical Point: Energy improvement disappears in 16×16 configuration (-1.2%)

Frequency Scaling Analysis

  1. Iso-Frequency Evaluation: Across the 500 MHz to 1.5 GHz range, OzMAC consistently maintains approximately 70% power improvement and 29% energy improvement
  2. Iso-Latency Evaluation: After frequency scaling to match throughput, OzMAC still achieves:
    • INT4 designs: 29% power/energy improvement
    • INT8 designs: 30% power/energy improvement
    • Mixed-precision designs: up to 46% improvement

Key Findings

  1. Energy Efficiency Threshold: OzMAC requires at least 58% bit sparsity to maintain superior energy efficiency compared to bMAC
  2. Practical Sparsity: All tested DL models exceed this threshold
  3. Scaling Characteristics: Power scales linearly with frequency, while energy efficiency remains essentially constant

This paper builds upon the following related research:

  1. Bit-Pragmatic (PRA): Original bit-pragmatic deep neural network computation method
  2. Bit-Tactical: Software/hardware approach exploiting value and bit sparsity
  3. STRIPES: Bit-serial deep neural network computation
  4. Bit Fusion: Bit-level dynamically composable architecture

The primary distinction of this work is rigorous evaluation using the latest commercial process technology and extension to multiple precision and frequency configurations.

Conclusions and Discussion

Main Conclusions

  1. Significant Improvements: OzMAC achieves significant improvements in area, power consumption, and energy efficiency relative to traditional bMAC
  2. Commercial Feasibility: Evaluation using TSMC N5 process demonstrates commercial implementation viability
  3. Scaling Advantages: Maintains advantages across multiple precision and frequency configurations
  4. Throughput Matching: Frequency scaling enables matching or exceeding bMAC throughput while maintaining energy efficiency advantages

Limitations

  1. Latency Overhead: OzMAC's multi-cycle latency may not be suitable for latency-sensitive applications
  2. Precision Limitations: Advantages disappear at precisions above 16-bit
  3. Sparsity Dependency: Performance heavily depends on input data bit sparsity
  4. Missing System-Level Evaluation: Evaluation at actual DLA system level remains incomplete

Future Directions

  1. System-Level Integration: Evaluate large OzMAC arrays in actual DLA implementations
  2. Adaptive Design: Dynamically adjust configurations based on runtime sparsity
  3. Hybrid Architecture: Combined design incorporating both OzMAC and traditional MAC

In-Depth Evaluation

Strengths

  1. Rigorous Evaluation: Comprehensive evaluation using commercial-grade process and tools with high result credibility
  2. Multi-Dimensional Analysis: Systematic analysis across precision and frequency dimensions
  3. Practical Value: Validates bit sparsity existence in actual DL models
  4. Clear Presentation: Technical details are clearly described with complete experimental setup

Weaknesses

  1. Limited Innovation: Primarily engineering implementation and evaluation of existing PRA design with relatively limited technical novelty
  2. Limited Application Scope: Only applicable to workloads with high bit sparsity
  3. Insufficient System Considerations: Lacks consideration of memory bandwidth, data flow, and other system-level factors
  4. Limited Comparisons: Primarily compares against baseline bMAC, lacking comparison with other advanced MAC designs

Impact

  1. Engineering Value: Provides valuable reference data for commercial DLA design
  2. Methodological Contribution: Establishes rigorous MAC design evaluation framework
  3. Practical Guidance: Provides feasible hardware optimization solutions for low-precision inference applications

Applicable Scenarios

  1. Edge Inference: Power and area-constrained edge AI applications
  2. Low-Precision Computing: Deep learning inference at 8-bit precision and below
  3. Sparse Models: Neural network models with high bit sparsity characteristics
  4. Mass Production: Large-scale deployment scenarios requiring commercial-grade process validation

References

  1. Sze, V., et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture, 2020.
  2. Albericio, J., et al. "Bit-pragmatic deep neural network computing." MICRO, 2017.
  3. Delmas Lascorz, A., et al. "Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks." ASPLOS, 2019.
  4. Judd, P., et al. "Stripes: Bit-serial deep neural network computing." MICRO, 2016.
  5. Sharma, H., et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." ISCA, 2018.

This paper provides important engineering validation for the commercialization of zero-skipping MAC design. While technical innovation is limited, its rigorous evaluation methodology and practical results hold significant value for advancing the development of low-power AI accelerators.