2025-11-14T21:31:11.905402

Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA

Ando, Eto, Takeuchi et al.

The rise of generative AI for tasks like Automatic Speech Recognition (ASR) has created a critical energy consumption challenge. While ASICs offer high efficiency, they lack the programmability to adapt to evolving algorithms. To address this trade-off, we implement and evaluate Whisper's core computational kernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs) accelerator. To our knowledge, this is the first work to execute a Whisper kernel on a CGRA and compare its performance against CPUs and GPUs. Using hardware/software co-design, we evaluate our system via an FPGA prototype and project performance for a 28 nm ASIC. Our results demonstrate superior energy efficiency. The projected ASIC is 1.90x more energy-efficient than the NVIDIA Jetson AGX Orin and 9.83x more than an NVIDIA RTX 4090 for the Q8_0 model. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.

academic

Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA

Basic Information

Paper ID: 2511.02269
Title: Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA
Authors: Takuto ANDO, Yu ETO, Ayumu TAKEUCHI, Yasuhiko NAKASHIMA (Nara Institute of Science and Technology)
Classification: cs.AR (Computer Architecture)
Publication Date: November 4, 2025 (arXiv submission)
Paper Link: https://arxiv.org/abs/2511.02269

Abstract

The rise of generative AI in automatic speech recognition (ASR) and other tasks presents severe energy consumption challenges. While ASICs offer high efficiency, they lack the programmability to adapt to algorithm evolution. To address this trade-off, this paper implements and evaluates Whisper's core computational kernels on IMAX, a general-purpose coarse-grained linear array (CGLA) accelerator. To the authors' knowledge, this is the first work executing Whisper kernels on a CGRA with performance comparisons against CPUs and GPUs. Through hardware/software co-design, the authors evaluate the system via FPGA prototyping and predict 28nm ASIC performance. Results demonstrate superior energy efficiency: for the Q8_0 model, the predicted ASIC achieves 1.90× better energy efficiency than NVIDIA Jetson AGX Orin and 9.83× better than NVIDIA RTX 4090. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.

Research Background and Motivation

1. Problem Statement

This research addresses the energy consumption crisis facing AI-driven automatic speech recognition systems. With the widespread deployment of advanced ASR models like Whisper (smart assistants, real-time transcription, medical applications), their computational demands have led to dramatic increases in data center energy consumption. The International Energy Agency predicts that data center electricity consumption may double by 2030 to 945 TWh, slightly exceeding Japan's annual total electricity consumption.

2. Problem Significance

Energy Sustainability Crisis: AI infrastructure heavily relies on high-power GPGPUs, with poor energy efficiency from single general-purpose architectures that is unsustainable
Edge Device Requirements: Power-constrained edge devices (smartphones, IoT devices) require highly energy-efficient ASR solutions
Rapid Algorithm Evolution: AI algorithms continuously evolve, requiring hardware platforms that balance both efficiency and flexibility

3. Limitations of Existing Approaches

ASIC Specialized Accelerators: While extremely energy-efficient, they lack programmability and struggle to adapt to rapidly evolving algorithms, causing hardware to become obsolete
FPGA Solutions: Optimized for specific models (CNNs, Transformers) with strong specialization and poor portability
GPU Solutions: Provide high performance and flexibility but consume excessive power, unsuitable for edge devices

4. Research Motivation

The authors propose using the IMAX accelerator based on CGLA (coarse-grained linear array) architecture to find the optimal balance between ASIC efficiency and GPGPU programmability. IMAX, through linearly arranged processing elements (PEs) and local memory modules (LMMs), can absorb irregular memory access patterns while maintaining high throughput and energy efficiency.

Core Contributions

First Implementation: First implementation and evaluation of Whisper ASR kernels on CGRA architecture, establishing hardware/software co-design principles for handling dynamic variable-length workloads
Superior Energy Efficiency: Based on FPGA prototype estimation, the optimized 28nm ASIC configuration achieves excellent energy efficiency on the Q8_0 quantized model, 1.90× better than Jetson AGX Orin and 9.83× better than RTX 4090
Architecture Optimization Analysis: Systematic analysis of LMM size trade-offs with overall energy efficiency, proving that 32KB LMM configuration achieves optimal balance between maximizing kernel coverage and minimizing static power overhead
Scalability Verification: Demonstrates applicability to larger Whisper models (base, small), proving the architecture's scalability potential

Methodology Details

Task Definition

Objective: Efficiently execute Whisper ASR model's core computational kernels (primarily dot product operations) on the IMAX CGLA accelerator

Input: Audio file of approximately 10 seconds (jfk.wav)

Output: Text transcription result

Constraints:

Power-constrained edge device scenarios
Need to handle variable-length vectors
Need to balance energy efficiency and performance

Model Architecture

1. IMAX3 System Architecture

As shown in Figure 2, IMAX3 is implemented as an 8-channel configuration deployed on an AMD Versal VPK180 FPGA:

Processing System (PS): ARM Cortex-A72 dual-core CPU
Programmable Logic (PL): Hosts CGLA core
Interconnect: Connected via on-chip network (NoC)
Memory: 8GB DDR4 for OS buffering, 4GB DDR4 for DMA buffering

2. IMAX Channel Internal Structure (Figure 3)

Each IMAX channel contains:

Processing Units (PE): Pipelined ALU and local memory modules (LMM)
Linear Array Structure: PEs and LMMs strategically interleaved
Data Paths: Separated execution and memory data paths
DMA Interface: AXI DMA read/write interfaces

3. Whisper Processing Pipeline (Figure 1)

Feature Extraction: Mel spectrogram generation
Encoder: Multi-head attention and feed-forward networks (primary computational load)
Decoder: Autoregressive text generation
Acceleration Focus: Dot product kernels (computational core of encoder and decoder)

Technical Innovations

1. Kernel-Level Co-Design

FP16 Dot Product Kernel Optimization:

Inline Type Conversion: Leveraging IMAX's programmability, executing FP16 to FP32 conversion through PE's bit manipulation capabilities, avoiding specialized hardware
SIMD Operations: Applying SIMD on FMA units, concurrently executing two 32-bit operations on a single 64-bit data path
Column-wise Multithreading: Employing column-wise multithreading to time-multiplex four logical FMA operations onto a single physical FPU, hiding FPU latency

Hybrid Execution Strategy (handling variable-length vectors):

Dividing each vector into two segments: main segment (multiple of burst length) processed on IMAX; residual segment concurrently processed on host CPU
Burst length selection of 16 elements (based on Whisper vector length distribution analysis)
CPU residual processing accounts for only ~5% of total computation

Q8_0 Kernel: Reuses quantization kernel implementation from prior work

2. Data Processing and LMM Configuration Optimization

Padding Elimination Technique:

FP16 tensors in whisper.cpp contain substantial padding to satisfy 32-byte alignment requirements
Host CPU strips all padding before DMA transfer and packs data densely
Significant effect: As shown in Table I, for FP16 model, baseline configuration with 32KB LMM only accommodates 1.39% of kernels; optimization improves coverage to 93.80%

LMM Size Selection (Table II):

Based on power estimation from logic synthesis (Synopsys Design Compiler, TSMC 28nm process)
FP16 kernel: 16KB LMM power 0.665W, 32KB is 0.675W (negligible increase)
Kernel coverage: 16KB covers 66.35%, 32KB covers 93.80%
Optimal Choice: 32KB LMM achieves best balance between performance improvement and power increase

3. Hardware/Software Co-Design Objectives

Maximize Computational Throughput: Fully utilize IMAX parallel processing capability
Maximize Data Transfer Efficiency: Improve effective memory bandwidth, efficiently utilize LMM

Experimental Setup

Dataset

Audio File: whisper.cpp standard test file jfk.wav (~10 seconds)
Models: Whisper-tiny.en model (78MB)
- FP16 version
- Q8_0 quantized version

Evaluation Metrics

End-to-End Latency: Measured using gettimeofday function (microsecond precision)
Power Consumption:
- IMAX: Logic synthesis estimation
- CPU: Estimated value
- GPU: Nominal thermal design power (TDP)
Power-Delay Product (PDP): PDP = execution time × power
- Key metric for comprehensive energy efficiency evaluation
- Lower values indicate better energy efficiency

Comparison Methods

As shown in Table III, comparison platforms include:

ARM Cortex-A72 (embedded CPU)
- 2 cores, 1400 MHz
- Power: 0.6485W
NVIDIA Jetson AGX Orin 32GB (edge GPU)
- 1792 CUDA cores, 930 MHz
- Power: 15W (minimum power mode)
NVIDIA GeForce RTX 4090 (high-end GPU)
- 16384 CUDA cores, 2520 MHz
- Power: 450W (TDP)
IMAX3 (FPGA Prototype)
- 64 PE, 145 MHz
- Power: 180W (entire FPGA system)
IMAX3 (28nm ASIC Prediction)
- 64 PE, 840 MHz (6× frequency increase)
- Power: 0.647W (FP16) / 1.32W (Q8_0), single channel 32KB LMM configuration

Implementation Details

FPGA Tool: Vivado 2024.1
Synthesis Tool: Synopsys Design Compiler
Process Library: TSMC 28nm
FPGA Frequency: 140 MHz
ASIC Predicted Frequency: 840 MHz (verified through static timing analysis)
Evaluation Configuration: 1-channel and 2-channel configurations
Host Thread Count: 1-2 thread variations

Experimental Results

Main Results

1. End-to-End Latency Comparison (Figure 4)

FP16 Model (2-thread execution):

ARM Cortex-A72: 24.4 seconds
IMAX (FPGA 2-lane): ~21 seconds
IMAX (28nm ASIC 2-lane): 13.5 seconds
Jetson AGX Orin: 1.6 seconds
RTX 4090: 0.49 seconds

Q8_0 Model (2-thread execution):

ARM Cortex-A72: 19.6 seconds
IMAX (FPGA 2-lane): ~17 seconds
IMAX (28nm ASIC 2-lane): 11.1 seconds
Jetson AGX Orin: 1.6 seconds
RTX 4090: 0.50 seconds

Analysis: IMAX ASIC shows significant acceleration compared to embedded CPU implementation, but absolute speed lags behind GPUs (GPUs possess massive parallel computing resources)

2. Energy Efficiency Comparison (PDP, Figure 5)

FP16 Model (2-thread execution):

ARM Cortex-A72: 15.8 J
IMAX (28nm ASIC 2-lane): 13.6 J
Jetson AGX Orin: 24.0 J
RTX 4090: 120.1 J

Q8_0 Model (2-thread execution):

ARM Cortex-A72: 12.7 J
IMAX (28nm ASIC 2-lane): 12.6 J ✓ Best
Jetson AGX Orin: 24.0 J
RTX 4090: 123.8 J

Key Findings:

IMAX (28nm ASIC) Q8_0 model energy efficiency is 1.90× better than Jetson AGX Orin
9.83× better than RTX 4090
Q8_0 quantization further improves energy efficiency compared to FP16 model

Ablation Studies

1. LMM Size Optimization (Figure 6)

FP16 Model PDP (2-thread):

16KB LMM: ~15 J
32KB LMM: 13.6 J ✓ Optimal
64KB LMM: ~14 J
128KB LMM: ~15 J

Q8_0 Model PDP (2-thread):

16KB LMM: ~14 J
32KB LMM: 12.6 J ✓ Optimal
64KB LMM: ~13.5 J
128KB LMM: ~15 J

Analysis:

16KB: Poor latency and PDP (CPU must handle unsuitable kernels)
32KB: Achieves minimum PDP (optimal balance point)
64KB/128KB: Slight latency improvement but increased static power, PDP actually worsens

Conclusion: 32KB LMM is the energy-optimal configuration, validating the correctness of design choices

2. Computational Efficiency Verification (Figure 7)

Execution Time Decomposition:

EXEC (PE pure computation): 60.89% for FP16, 74.70% for Q8_0
LOAD/DRAIN (DRAM to LMM data transfer): Relatively small
CONF/REGV/RANGE/REFILL (IMAX configuration): Relatively small

Key Insights:

High EXEC ratio indicates IMAX is in computation-limited state (not memory-limited)
Successfully mitigated data movement overhead
Effectively unleashed IMAX's high throughput potential

Scalability Analysis (Table IV)

Kernel Coverage Rate for Larger Models (optimized):

Model	Size	Operations	32KB Coverage	64KB Coverage
tiny	78MB	477,153	93.80%	93.80%
base	148MB	644,690	66.54%	94.17%
small	488MB	1,920,955	66.52%	94.36%

Findings:

Although computational load increases significantly, memory footprint per operation does not scale proportionally
64KB LMM can cover over 94% of kernels for base and small models
Demonstrates good scalability of architecture to larger models
Requires trade-off between increased static power and performance improvement

1. AI Hardware Accelerators

Specialization Approaches (ASIC/FPGA):

Park et al.: Hybrid CNN and smartphone language model FPGA system
Hu et al.: GCNN model-specific FPGA accelerator
Yamini et al.: End-to-end Transformer ASR acceleration using systolic arrays
Limitations: Model-specific optimization, poor flexibility, difficulty adapting to algorithm evolution

This Paper's Advantage: IMAX is a general-purpose architecture not bound to specific AI tasks, capable of rapidly adapting to algorithm changes

2. CGRA Architecture Evolution

Traditional CGRA Challenges:

Scalability issues
Long compilation times

IMAX Innovation:

Evolution based on CGLA (coarse-grained linear array)
PE and LMM linearly interleaved arrangement
Effectively hides irregular memory access latency

Prior IMAX Applications:

Compute-intensive kernels: SpGEMM, FFT
Modern AI workloads: CNN, LLM, approximate k-NN search (RAG)
This Paper's Extension: First application to ASR task dot product operations

3. Whisper Hardware Implementation

To the authors' knowledge, this is the first hardware implementation and evaluation of Whisper on CGRA, filling an important gap in the field.

Conclusions and Discussion

Main Conclusions

First Implementation: Successfully implemented Whisper ASR kernels on CGLA architecture, establishing hardware/software co-design methodology
Energy Efficiency Advantage: 28nm ASIC prediction shows PDP of 12.6J on Q8_0 model, 1.90× better energy efficiency than edge GPU (Jetson AGX Orin), 9.83× better than high-end GPU (RTX 4090)
Design Trade-offs: While absolute latency lags behind GPUs, in power-constrained edge applications, energy efficiency is more critical than low latency
Architecture Insights: 32KB LMM configuration achieves optimal balance between kernel coverage and static power overhead
Scalability: Demonstrates applicability to larger Whisper models (base, small)

Limitations

Power Evaluation Methodology:
- GPUs use nominal TDP rather than measured average power
- TDP represents peak power rather than workload average power
- Results should be viewed as architecture potential indicators rather than definitive advantage measurements
- Requires measured average power for precise comparison
Absolute Performance:
- IMAX latency significantly exceeds GPU (ASIC prediction 13.5s vs GPU 0.49s)
- Unsuitable for latency-sensitive real-time applications
Model Scope:
- Only evaluated Whisper-tiny.en model
- Larger models (base, small) only theoretically analyzed, not actually implemented
ASIC Implementation:
- 28nm ASIC performance based on synthesis estimation and frequency speculation
- No actual tape-out verification
Single Workload:
- Only tested 10-second audio file
- No robustness evaluation across different lengths, languages, noise environments

Future Directions

Extension to Larger Models: Implement and evaluate Whisper base and small models, optimize power-performance balance
Further Kernel Optimization: Adjust architecture parameters such as computation unit count
Actual ASIC Tape-Out: Verify accuracy of 28nm ASIC predictions
Precise Power Measurement: Use measured average power rather than TDP for fair comparison
Diverse Workloads: Evaluate performance across different audio lengths, multiple languages, noisy environments

In-Depth Evaluation

Strengths

Strong Novelty:
- First mapping of Whisper ASR to CGRA architecture
- Fills important gap in ASR hardware acceleration field
- Proposes hybrid execution strategy for handling variable-length vectors
Systematic Methodology:
- Complete hardware/software co-design process
- Comprehensive consideration from kernel optimization to data processing to architecture parameter tuning
- Padding elimination technique significantly improves LMM utilization (1.39%→93.80%)
Comprehensive Experiments:
- Multi-platform comparison (CPU, edge GPU, high-end GPU, FPGA, ASIC prediction)
- Detailed ablation studies (LMM size, execution time decomposition)
- Scalability analysis (theoretical verification for larger models)
High Practical Value:
- Energy efficiency optimization for edge devices has important real-world significance
- Clear advantages in battery life and thermal management-critical scenarios
- CGLA's generality ensures adaptability to algorithm evolution
Clear Technical Details:
- Detailed description of FP16 kernel SIMD and multithreading optimization
- Hybrid execution strategy burst length selection supported by data
- Clear architecture and data flow diagrams

Weaknesses

Unfair Power Comparison:
- Using GPU TDP rather than measured power is a significant methodological flaw
- Undermines credibility of energy efficiency advantage claims
- Should supplement with measured power data
Significant Performance Gap:
- ASIC predicted latency still 27× GPU (13.5s vs 0.49s)
- Limits practical application scenarios (unsuitable for real-time interaction)
- Insufficient discussion of how to apply in latency-sensitive scenarios
Insufficient ASIC Verification:
- 840MHz frequency based on synthesis estimation, unverified by physical design
- Reasonableness of 6× frequency increase needs more support
- Lacks post-layout actual power and timing data
Limited Evaluation Scope:
- Only tested single 10-second audio file
- Lacks robustness evaluation across different scenarios (noise, accents, long audio)
- No model accuracy evaluation (only performance and efficiency)
Reproducibility Challenges:
- IMAX3 is proprietary architecture, difficult for external researchers to reproduce
- Insufficient detail in FPGA implementation specifics
- Code and models not publicly available
Insufficient Theoretical Analysis:
- Lacks theoretical upper bound analysis of efficiency advantages
- Insufficient analysis of why CGLA particularly suits ASR tasks
- Lacks theoretical derivation of 5% residual processing overhead in hybrid execution

Impact

Academic Contribution:
- Pioneering research direction for Whisper on CGRA
- Provides new architecture option for ASR hardware acceleration
- Hardware/software co-design methodology has reference value
Practical Value:
- Important reference for edge AI device manufacturers
- High potential in IoT, wearable devices and other power-constrained scenarios
- Provides technical pathway for sustainable AI
Limitations:
- IMAX proprietary architecture limits broad application
- Performance gap makes it difficult to replace GPU as mainstream solution
- Requires actual tape-out to verify commercial viability

Applicable Scenarios

Most Suitable:

Power-constrained edge devices (smartwatches, hearing aids, IoT devices)
Applications with high latency tolerance but extreme efficiency requirements
Offline ASR scenarios where battery life is critical
Embedded systems with strict thermal management

Unsuitable:

Real-time interactive applications (voice assistants)
Latency-sensitive scenarios (requiring millisecond-level response)
Data center scenarios with abundant power supply
Batch processing tasks requiring ultra-long audio handling

References

This paper cites 27 important references, key references include:

Whisper Original Paper: Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022)
whisper.cpp Implementation: Gerganov, GitHub open-source project (2023)
IMAX Architecture: Akabe et al., "IMAX: A power-efficient multilevel pipelined cgla and applications" IEEE Access (2025)
CGRA Survey: Torng et al., "Ultra-Elastic CGRAs for Irregular Loop Specialization" HPCA (2021)
Energy Prediction: IEA, "Energy and AI" (2025)

Summary

This paper is an innovative work in ASR hardware acceleration field, first exploring CGLA architecture application on Whisper model. Through systematic hardware/software co-design, the authors demonstrate IMAX's significant energy efficiency advantage over GPUs (Q8_0 model 9.83× better than RTX 4090). While having limitations such as insufficiently rigorous power evaluation methodology and absolute performance inferior to GPUs, the method has important practical value and research significance in power-constrained edge device scenarios. The optimal 32KB LMM configuration selection, 93.80% kernel coverage improvement from padding elimination technique, and scalability analysis for larger models all demonstrate the authors' deep engineering insights. Future actual ASIC tape-out verification and supplementary precise power measurements will further strengthen this work's persuasiveness and impact.