General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.
academic- Paper ID: 2402.19376
- Title: Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference
- Authors: Harideep Nair, Prabhu Vellaisamy, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen
- Institutions: Carnegie Mellon University, MediaTek USA Inc.
- Category: cs.AR (Computer Architecture)
- Publication Date: February 2024
- Paper Link: https://arxiv.org/abs/2402.19376
This paper proposes OzMAC (Omit-zero-MAC), an improved implementation of the Bit-Pragmatic (PRA) MAC design specifically designed to exploit bit sparsity in deep learning inference. Unlike previous work, this paper conducts rigorous post-synthesis evaluation using commercial-grade TSMC N5 process technology across multiple bit-widths and clock frequencies. The study demonstrates high bit sparsity in eight pre-trained INT8 deep learning workloads, with 8-bit OzMAC achieving significant improvements of 21%, 70%, and 28% in area, power consumption, and energy efficiency, respectively.
- Computational Bottleneck: The multiply-accumulate (MAC) array in general matrix multiplication (GEMM) units is the core computational structure of deep learning accelerators, with its efficiency directly impacting overall performance
- Precision Trends: Industry standards are transitioning from 32-bit floating-point (FP32) to 16-bit floating-point (FP16), 8-bit integer (INT8), and even lower precision formats
- Energy Efficiency Requirements: Edge inference applications impose strict constraints on area, power consumption, and energy efficiency
- Deep learning models contain substantial bit sparsity, i.e., numerous '0' bits in binary representations
- While existing Bit-Pragmatic (PRA) designs propose the concept of exploiting bit sparsity, they lack rigorous evaluation using commercial-grade processes
- There is a need to assess the feasibility and benefits of zero-skipping MAC designs in practical commercial implementations
- OzMAC Design: An improved zero-skipping MAC architecture based on PRA that dynamically exploits bit sparsity by skipping zero bits in binary values
- Commercial-Grade Evaluation: Rigorous power-performance-area (PPA) evaluation using TSMC N5 (5nm) process technology and commercial design tools
- Multi-Dimensional Analysis: Comprehensive evaluation across multiple data precisions (4-bit, 8-bit, 16-bit) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz)
- Sparsity Verification: Validation of high bit sparsity across eight deep learning models and demonstration of how to leverage power reduction to increase throughput
OzMAC comprises three core functional modules:
- Oz-encoder (Zero Encoder):
- Finite state machine that tracks the current and next positions of '1' bits in input bit patterns
- Outputs one-hot encoded values capturing the position of '1' bits each clock cycle
- Example: Input '0101₂' is encoded as two one-hot values across two clock cycles: first cycle '0100₂', next cycle '0001₂'
- Shifter:
- Determines the shift amount for the second input based on Oz-encoder output
- Unlike PRA's binary shift values, OzMAC employs one-hot representation to simplify shifter hardware
- Accumulator:
- Adds the appropriately shifted second input to the accumulator value
- Zero-Skipping Mechanism: Performs computation only on '1' bits while skipping '0' bits, reducing computational cycles
- Shifter Optimization: Uses one-hot encoded input to simplify shifter gate complexity
- Serial Computation: Trades latency for lower area and power consumption
- Process Technology: TSMC N5 (5nm) commercial process
- Design Tools: Synopsys VCS, SpyGlass, Design Compiler, PrimeTime PX
- Verification Method: SystemVerilog RTL design, gate-level netlist simulation, SAIF dump for precise power calculation
Eight pre-trained quantized INT8 models from PyTorch Torchvision library:
- MobileNetV2, MobileNetV3
- InceptionV3, ShuffleNetV2
- GoogleNet, ResNet18, ResNet50, ResNeXt101
- Area: Chip area (μm²)
- Power: Dynamic power consumption (mW)
- Latency: Computational latency (ns)
- Energy: Energy per operation (pJ)
- Precision Configurations: 4×4, 4×8, 8×8, 8×16, 16×16 bits
- Frequency Range: 500 MHz, 1 GHz, 1.5 GHz
- Baseline Comparison: Traditional bit-parallel bMAC design
| Model | Average '1' Bits | Bit Sparsity Percentage |
|---|
| MobileNetV2 | 2.334 | 70.83% |
| MobileNetV3 | 1.711 | 78.61% |
| InceptionV3 | 2.430 | 69.62% |
| ShuffleNetV2 | 2.583 | 67.71% |
| GoogleNet | 2.461 | 69.24% |
| ResNet18 | 2.398 | 70.02% |
| ResNet50 | 2.495 | 68.81% |
| ResNeXt101 | 2.289 | 71.39% |
All models exhibit approximately 70% bit sparsity, with MobileNetV3 achieving the highest at 78.61%.
| MAC Hardware | Area (μm²) | Power (mW) | Latency (ns) | Energy (pJ) |
|---|
| bMAC | 25.361 | 0.084 | 2 | 0.167 |
| OzMAC | 19.996 | 0.025 | 4.76 | 0.120 |
| Improvement Percentage | 21.2% | 69.7% | - | 28.0% |
Results across different precision configurations show:
- Best Area Improvement: 31.7% achieved in 8×16 configuration
- Best Energy Improvement: 45% achieved in mixed-precision 4×8 and 8×16 configurations
- Critical Point: Energy improvement disappears in 16×16 configuration (-1.2%)
- Iso-Frequency Evaluation: Across the 500 MHz to 1.5 GHz range, OzMAC consistently maintains approximately 70% power improvement and 29% energy improvement
- Iso-Latency Evaluation: After frequency scaling to match throughput, OzMAC still achieves:
- INT4 designs: 29% power/energy improvement
- INT8 designs: 30% power/energy improvement
- Mixed-precision designs: up to 46% improvement
- Energy Efficiency Threshold: OzMAC requires at least 58% bit sparsity to maintain superior energy efficiency compared to bMAC
- Practical Sparsity: All tested DL models exceed this threshold
- Scaling Characteristics: Power scales linearly with frequency, while energy efficiency remains essentially constant
This paper builds upon the following related research:
- Bit-Pragmatic (PRA): Original bit-pragmatic deep neural network computation method
- Bit-Tactical: Software/hardware approach exploiting value and bit sparsity
- STRIPES: Bit-serial deep neural network computation
- Bit Fusion: Bit-level dynamically composable architecture
The primary distinction of this work is rigorous evaluation using the latest commercial process technology and extension to multiple precision and frequency configurations.
- Significant Improvements: OzMAC achieves significant improvements in area, power consumption, and energy efficiency relative to traditional bMAC
- Commercial Feasibility: Evaluation using TSMC N5 process demonstrates commercial implementation viability
- Scaling Advantages: Maintains advantages across multiple precision and frequency configurations
- Throughput Matching: Frequency scaling enables matching or exceeding bMAC throughput while maintaining energy efficiency advantages
- Latency Overhead: OzMAC's multi-cycle latency may not be suitable for latency-sensitive applications
- Precision Limitations: Advantages disappear at precisions above 16-bit
- Sparsity Dependency: Performance heavily depends on input data bit sparsity
- Missing System-Level Evaluation: Evaluation at actual DLA system level remains incomplete
- System-Level Integration: Evaluate large OzMAC arrays in actual DLA implementations
- Adaptive Design: Dynamically adjust configurations based on runtime sparsity
- Hybrid Architecture: Combined design incorporating both OzMAC and traditional MAC
- Rigorous Evaluation: Comprehensive evaluation using commercial-grade process and tools with high result credibility
- Multi-Dimensional Analysis: Systematic analysis across precision and frequency dimensions
- Practical Value: Validates bit sparsity existence in actual DL models
- Clear Presentation: Technical details are clearly described with complete experimental setup
- Limited Innovation: Primarily engineering implementation and evaluation of existing PRA design with relatively limited technical novelty
- Limited Application Scope: Only applicable to workloads with high bit sparsity
- Insufficient System Considerations: Lacks consideration of memory bandwidth, data flow, and other system-level factors
- Limited Comparisons: Primarily compares against baseline bMAC, lacking comparison with other advanced MAC designs
- Engineering Value: Provides valuable reference data for commercial DLA design
- Methodological Contribution: Establishes rigorous MAC design evaluation framework
- Practical Guidance: Provides feasible hardware optimization solutions for low-precision inference applications
- Edge Inference: Power and area-constrained edge AI applications
- Low-Precision Computing: Deep learning inference at 8-bit precision and below
- Sparse Models: Neural network models with high bit sparsity characteristics
- Mass Production: Large-scale deployment scenarios requiring commercial-grade process validation
- Sze, V., et al. "Efficient processing of deep neural networks." Synthesis Lectures on Computer Architecture, 2020.
- Albericio, J., et al. "Bit-pragmatic deep neural network computing." MICRO, 2017.
- Delmas Lascorz, A., et al. "Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks." ASPLOS, 2019.
- Judd, P., et al. "Stripes: Bit-serial deep neural network computing." MICRO, 2016.
- Sharma, H., et al. "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network." ISCA, 2018.
This paper provides important engineering validation for the commercialization of zero-skipping MAC design. While technical innovation is limited, its rigorous evaluation methodology and practical results hold significant value for advancing the development of low-power AI accelerators.