The rise of generative AI for tasks like Automatic Speech Recognition (ASR) has created a critical energy consumption challenge. While ASICs offer high efficiency, they lack the programmability to adapt to evolving algorithms. To address this trade-off, we implement and evaluate Whisper's core computational kernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs) accelerator. To our knowledge, this is the first work to execute a Whisper kernel on a CGRA and compare its performance against CPUs and GPUs. Using hardware/software co-design, we evaluate our system via an FPGA prototype and project performance for a 28 nm ASIC. Our results demonstrate superior energy efficiency. The projected ASIC is 1.90x more energy-efficient than the NVIDIA Jetson AGX Orin and 9.83x more than an NVIDIA RTX 4090 for the Q8_0 model. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.
- Paper ID: 2511.02269
- Title: Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA
- Authors: Takuto ANDO, Yu ETO, Ayumu TAKEUCHI, Yasuhiko NAKASHIMA (Nara Institute of Science and Technology)
- Classification: cs.AR (Computer Architecture)
- Publication Date: November 4, 2025 (arXiv submission)
- Paper Link: https://arxiv.org/abs/2511.02269
The rise of generative AI in automatic speech recognition (ASR) and other tasks presents severe energy consumption challenges. While ASICs offer high efficiency, they lack the programmability to adapt to algorithm evolution. To address this trade-off, this paper implements and evaluates Whisper's core computational kernels on IMAX, a general-purpose coarse-grained linear array (CGLA) accelerator. To the authors' knowledge, this is the first work executing Whisper kernels on a CGRA with performance comparisons against CPUs and GPUs. Through hardware/software co-design, the authors evaluate the system via FPGA prototyping and predict 28nm ASIC performance. Results demonstrate superior energy efficiency: for the Q8_0 model, the predicted ASIC achieves 1.90× better energy efficiency than NVIDIA Jetson AGX Orin and 9.83× better than NVIDIA RTX 4090. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.
This research addresses the energy consumption crisis facing AI-driven automatic speech recognition systems. With the widespread deployment of advanced ASR models like Whisper (smart assistants, real-time transcription, medical applications), their computational demands have led to dramatic increases in data center energy consumption. The International Energy Agency predicts that data center electricity consumption may double by 2030 to 945 TWh, slightly exceeding Japan's annual total electricity consumption.
- Energy Sustainability Crisis: AI infrastructure heavily relies on high-power GPGPUs, with poor energy efficiency from single general-purpose architectures that is unsustainable
- Edge Device Requirements: Power-constrained edge devices (smartphones, IoT devices) require highly energy-efficient ASR solutions
- Rapid Algorithm Evolution: AI algorithms continuously evolve, requiring hardware platforms that balance both efficiency and flexibility
- ASIC Specialized Accelerators: While extremely energy-efficient, they lack programmability and struggle to adapt to rapidly evolving algorithms, causing hardware to become obsolete
- FPGA Solutions: Optimized for specific models (CNNs, Transformers) with strong specialization and poor portability
- GPU Solutions: Provide high performance and flexibility but consume excessive power, unsuitable for edge devices
The authors propose using the IMAX accelerator based on CGLA (coarse-grained linear array) architecture to find the optimal balance between ASIC efficiency and GPGPU programmability. IMAX, through linearly arranged processing elements (PEs) and local memory modules (LMMs), can absorb irregular memory access patterns while maintaining high throughput and energy efficiency.
- First Implementation: First implementation and evaluation of Whisper ASR kernels on CGRA architecture, establishing hardware/software co-design principles for handling dynamic variable-length workloads
- Superior Energy Efficiency: Based on FPGA prototype estimation, the optimized 28nm ASIC configuration achieves excellent energy efficiency on the Q8_0 quantized model, 1.90× better than Jetson AGX Orin and 9.83× better than RTX 4090
- Architecture Optimization Analysis: Systematic analysis of LMM size trade-offs with overall energy efficiency, proving that 32KB LMM configuration achieves optimal balance between maximizing kernel coverage and minimizing static power overhead
- Scalability Verification: Demonstrates applicability to larger Whisper models (base, small), proving the architecture's scalability potential
Objective: Efficiently execute Whisper ASR model's core computational kernels (primarily dot product operations) on the IMAX CGLA accelerator
Input: Audio file of approximately 10 seconds (jfk.wav)
Output: Text transcription result
Constraints:
- Power-constrained edge device scenarios
- Need to handle variable-length vectors
- Need to balance energy efficiency and performance
As shown in Figure 2, IMAX3 is implemented as an 8-channel configuration deployed on an AMD Versal VPK180 FPGA:
- Processing System (PS): ARM Cortex-A72 dual-core CPU
- Programmable Logic (PL): Hosts CGLA core
- Interconnect: Connected via on-chip network (NoC)
- Memory: 8GB DDR4 for OS buffering, 4GB DDR4 for DMA buffering
Each IMAX channel contains:
- Processing Units (PE): Pipelined ALU and local memory modules (LMM)
- Linear Array Structure: PEs and LMMs strategically interleaved
- Data Paths: Separated execution and memory data paths
- DMA Interface: AXI DMA read/write interfaces
- Feature Extraction: Mel spectrogram generation
- Encoder: Multi-head attention and feed-forward networks (primary computational load)
- Decoder: Autoregressive text generation
- Acceleration Focus: Dot product kernels (computational core of encoder and decoder)
FP16 Dot Product Kernel Optimization:
- Inline Type Conversion: Leveraging IMAX's programmability, executing FP16 to FP32 conversion through PE's bit manipulation capabilities, avoiding specialized hardware
- SIMD Operations: Applying SIMD on FMA units, concurrently executing two 32-bit operations on a single 64-bit data path
- Column-wise Multithreading: Employing column-wise multithreading to time-multiplex four logical FMA operations onto a single physical FPU, hiding FPU latency
Hybrid Execution Strategy (handling variable-length vectors):
- Dividing each vector into two segments: main segment (multiple of burst length) processed on IMAX; residual segment concurrently processed on host CPU
- Burst length selection of 16 elements (based on Whisper vector length distribution analysis)
- CPU residual processing accounts for only ~5% of total computation
Q8_0 Kernel: Reuses quantization kernel implementation from prior work
Padding Elimination Technique:
- FP16 tensors in whisper.cpp contain substantial padding to satisfy 32-byte alignment requirements
- Host CPU strips all padding before DMA transfer and packs data densely
- Significant effect: As shown in Table I, for FP16 model, baseline configuration with 32KB LMM only accommodates 1.39% of kernels; optimization improves coverage to 93.80%
LMM Size Selection (Table II):
- Based on power estimation from logic synthesis (Synopsys Design Compiler, TSMC 28nm process)
- FP16 kernel: 16KB LMM power 0.665W, 32KB is 0.675W (negligible increase)
- Kernel coverage: 16KB covers 66.35%, 32KB covers 93.80%
- Optimal Choice: 32KB LMM achieves best balance between performance improvement and power increase
- Maximize Computational Throughput: Fully utilize IMAX parallel processing capability
- Maximize Data Transfer Efficiency: Improve effective memory bandwidth, efficiently utilize LMM
- Audio File: whisper.cpp standard test file jfk.wav (~10 seconds)
- Models: Whisper-tiny.en model (78MB)
- FP16 version
- Q8_0 quantized version
- End-to-End Latency: Measured using gettimeofday function (microsecond precision)
- Power Consumption:
- IMAX: Logic synthesis estimation
- CPU: Estimated value
- GPU: Nominal thermal design power (TDP)
- Power-Delay Product (PDP): PDP = execution time × power
- Key metric for comprehensive energy efficiency evaluation
- Lower values indicate better energy efficiency
As shown in Table III, comparison platforms include:
- ARM Cortex-A72 (embedded CPU)
- 2 cores, 1400 MHz
- Power: 0.6485W
- NVIDIA Jetson AGX Orin 32GB (edge GPU)
- 1792 CUDA cores, 930 MHz
- Power: 15W (minimum power mode)
- NVIDIA GeForce RTX 4090 (high-end GPU)
- 16384 CUDA cores, 2520 MHz
- Power: 450W (TDP)
- IMAX3 (FPGA Prototype)
- 64 PE, 145 MHz
- Power: 180W (entire FPGA system)
- IMAX3 (28nm ASIC Prediction)
- 64 PE, 840 MHz (6× frequency increase)
- Power: 0.647W (FP16) / 1.32W (Q8_0), single channel 32KB LMM configuration
- FPGA Tool: Vivado 2024.1
- Synthesis Tool: Synopsys Design Compiler
- Process Library: TSMC 28nm
- FPGA Frequency: 140 MHz
- ASIC Predicted Frequency: 840 MHz (verified through static timing analysis)
- Evaluation Configuration: 1-channel and 2-channel configurations
- Host Thread Count: 1-2 thread variations
FP16 Model (2-thread execution):
- ARM Cortex-A72: 24.4 seconds
- IMAX (FPGA 2-lane): ~21 seconds
- IMAX (28nm ASIC 2-lane): 13.5 seconds
- Jetson AGX Orin: 1.6 seconds
- RTX 4090: 0.49 seconds
Q8_0 Model (2-thread execution):
- ARM Cortex-A72: 19.6 seconds
- IMAX (FPGA 2-lane): ~17 seconds
- IMAX (28nm ASIC 2-lane): 11.1 seconds
- Jetson AGX Orin: 1.6 seconds
- RTX 4090: 0.50 seconds
Analysis: IMAX ASIC shows significant acceleration compared to embedded CPU implementation, but absolute speed lags behind GPUs (GPUs possess massive parallel computing resources)
FP16 Model (2-thread execution):
- ARM Cortex-A72: 15.8 J
- IMAX (28nm ASIC 2-lane): 13.6 J
- Jetson AGX Orin: 24.0 J
- RTX 4090: 120.1 J
Q8_0 Model (2-thread execution):
- ARM Cortex-A72: 12.7 J
- IMAX (28nm ASIC 2-lane): 12.6 J ✓ Best
- Jetson AGX Orin: 24.0 J
- RTX 4090: 123.8 J
Key Findings:
- IMAX (28nm ASIC) Q8_0 model energy efficiency is 1.90× better than Jetson AGX Orin
- 9.83× better than RTX 4090
- Q8_0 quantization further improves energy efficiency compared to FP16 model
FP16 Model PDP (2-thread):
- 16KB LMM: ~15 J
- 32KB LMM: 13.6 J ✓ Optimal
- 64KB LMM: ~14 J
- 128KB LMM: ~15 J
Q8_0 Model PDP (2-thread):
- 16KB LMM: ~14 J
- 32KB LMM: 12.6 J ✓ Optimal
- 64KB LMM: ~13.5 J
- 128KB LMM: ~15 J
Analysis:
- 16KB: Poor latency and PDP (CPU must handle unsuitable kernels)
- 32KB: Achieves minimum PDP (optimal balance point)
- 64KB/128KB: Slight latency improvement but increased static power, PDP actually worsens
Conclusion: 32KB LMM is the energy-optimal configuration, validating the correctness of design choices
Execution Time Decomposition:
- EXEC (PE pure computation): 60.89% for FP16, 74.70% for Q8_0
- LOAD/DRAIN (DRAM to LMM data transfer): Relatively small
- CONF/REGV/RANGE/REFILL (IMAX configuration): Relatively small
Key Insights:
- High EXEC ratio indicates IMAX is in computation-limited state (not memory-limited)
- Successfully mitigated data movement overhead
- Effectively unleashed IMAX's high throughput potential
Kernel Coverage Rate for Larger Models (optimized):
| Model | Size | Operations | 32KB Coverage | 64KB Coverage |
|---|
| tiny | 78MB | 477,153 | 93.80% | 93.80% |
| base | 148MB | 644,690 | 66.54% | 94.17% |
| small | 488MB | 1,920,955 | 66.52% | 94.36% |
Findings:
- Although computational load increases significantly, memory footprint per operation does not scale proportionally
- 64KB LMM can cover over 94% of kernels for base and small models
- Demonstrates good scalability of architecture to larger models
- Requires trade-off between increased static power and performance improvement
Specialization Approaches (ASIC/FPGA):
- Park et al.: Hybrid CNN and smartphone language model FPGA system
- Hu et al.: GCNN model-specific FPGA accelerator
- Yamini et al.: End-to-end Transformer ASR acceleration using systolic arrays
- Limitations: Model-specific optimization, poor flexibility, difficulty adapting to algorithm evolution
This Paper's Advantage: IMAX is a general-purpose architecture not bound to specific AI tasks, capable of rapidly adapting to algorithm changes
Traditional CGRA Challenges:
- Scalability issues
- Long compilation times
IMAX Innovation:
- Evolution based on CGLA (coarse-grained linear array)
- PE and LMM linearly interleaved arrangement
- Effectively hides irregular memory access latency
Prior IMAX Applications:
- Compute-intensive kernels: SpGEMM, FFT
- Modern AI workloads: CNN, LLM, approximate k-NN search (RAG)
- This Paper's Extension: First application to ASR task dot product operations
To the authors' knowledge, this is the first hardware implementation and evaluation of Whisper on CGRA, filling an important gap in the field.
- First Implementation: Successfully implemented Whisper ASR kernels on CGLA architecture, establishing hardware/software co-design methodology
- Energy Efficiency Advantage: 28nm ASIC prediction shows PDP of 12.6J on Q8_0 model, 1.90× better energy efficiency than edge GPU (Jetson AGX Orin), 9.83× better than high-end GPU (RTX 4090)
- Design Trade-offs: While absolute latency lags behind GPUs, in power-constrained edge applications, energy efficiency is more critical than low latency
- Architecture Insights: 32KB LMM configuration achieves optimal balance between kernel coverage and static power overhead
- Scalability: Demonstrates applicability to larger Whisper models (base, small)
- Power Evaluation Methodology:
- GPUs use nominal TDP rather than measured average power
- TDP represents peak power rather than workload average power
- Results should be viewed as architecture potential indicators rather than definitive advantage measurements
- Requires measured average power for precise comparison
- Absolute Performance:
- IMAX latency significantly exceeds GPU (ASIC prediction 13.5s vs GPU 0.49s)
- Unsuitable for latency-sensitive real-time applications
- Model Scope:
- Only evaluated Whisper-tiny.en model
- Larger models (base, small) only theoretically analyzed, not actually implemented
- ASIC Implementation:
- 28nm ASIC performance based on synthesis estimation and frequency speculation
- No actual tape-out verification
- Single Workload:
- Only tested 10-second audio file
- No robustness evaluation across different lengths, languages, noise environments
- Extension to Larger Models: Implement and evaluate Whisper base and small models, optimize power-performance balance
- Further Kernel Optimization: Adjust architecture parameters such as computation unit count
- Actual ASIC Tape-Out: Verify accuracy of 28nm ASIC predictions
- Precise Power Measurement: Use measured average power rather than TDP for fair comparison
- Diverse Workloads: Evaluate performance across different audio lengths, multiple languages, noisy environments
- Strong Novelty:
- First mapping of Whisper ASR to CGRA architecture
- Fills important gap in ASR hardware acceleration field
- Proposes hybrid execution strategy for handling variable-length vectors
- Systematic Methodology:
- Complete hardware/software co-design process
- Comprehensive consideration from kernel optimization to data processing to architecture parameter tuning
- Padding elimination technique significantly improves LMM utilization (1.39%→93.80%)
- Comprehensive Experiments:
- Multi-platform comparison (CPU, edge GPU, high-end GPU, FPGA, ASIC prediction)
- Detailed ablation studies (LMM size, execution time decomposition)
- Scalability analysis (theoretical verification for larger models)
- High Practical Value:
- Energy efficiency optimization for edge devices has important real-world significance
- Clear advantages in battery life and thermal management-critical scenarios
- CGLA's generality ensures adaptability to algorithm evolution
- Clear Technical Details:
- Detailed description of FP16 kernel SIMD and multithreading optimization
- Hybrid execution strategy burst length selection supported by data
- Clear architecture and data flow diagrams
- Unfair Power Comparison:
- Using GPU TDP rather than measured power is a significant methodological flaw
- Undermines credibility of energy efficiency advantage claims
- Should supplement with measured power data
- Significant Performance Gap:
- ASIC predicted latency still 27× GPU (13.5s vs 0.49s)
- Limits practical application scenarios (unsuitable for real-time interaction)
- Insufficient discussion of how to apply in latency-sensitive scenarios
- Insufficient ASIC Verification:
- 840MHz frequency based on synthesis estimation, unverified by physical design
- Reasonableness of 6× frequency increase needs more support
- Lacks post-layout actual power and timing data
- Limited Evaluation Scope:
- Only tested single 10-second audio file
- Lacks robustness evaluation across different scenarios (noise, accents, long audio)
- No model accuracy evaluation (only performance and efficiency)
- Reproducibility Challenges:
- IMAX3 is proprietary architecture, difficult for external researchers to reproduce
- Insufficient detail in FPGA implementation specifics
- Code and models not publicly available
- Insufficient Theoretical Analysis:
- Lacks theoretical upper bound analysis of efficiency advantages
- Insufficient analysis of why CGLA particularly suits ASR tasks
- Lacks theoretical derivation of 5% residual processing overhead in hybrid execution
- Academic Contribution:
- Pioneering research direction for Whisper on CGRA
- Provides new architecture option for ASR hardware acceleration
- Hardware/software co-design methodology has reference value
- Practical Value:
- Important reference for edge AI device manufacturers
- High potential in IoT, wearable devices and other power-constrained scenarios
- Provides technical pathway for sustainable AI
- Limitations:
- IMAX proprietary architecture limits broad application
- Performance gap makes it difficult to replace GPU as mainstream solution
- Requires actual tape-out to verify commercial viability
Most Suitable:
- Power-constrained edge devices (smartwatches, hearing aids, IoT devices)
- Applications with high latency tolerance but extreme efficiency requirements
- Offline ASR scenarios where battery life is critical
- Embedded systems with strict thermal management
Unsuitable:
- Real-time interactive applications (voice assistants)
- Latency-sensitive scenarios (requiring millisecond-level response)
- Data center scenarios with abundant power supply
- Batch processing tasks requiring ultra-long audio handling
This paper cites 27 important references, key references include:
- Whisper Original Paper: Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022)
- whisper.cpp Implementation: Gerganov, GitHub open-source project (2023)
- IMAX Architecture: Akabe et al., "IMAX: A power-efficient multilevel pipelined cgla and applications" IEEE Access (2025)
- CGRA Survey: Torng et al., "Ultra-Elastic CGRAs for Irregular Loop Specialization" HPCA (2021)
- Energy Prediction: IEA, "Energy and AI" (2025)
This paper is an innovative work in ASR hardware acceleration field, first exploring CGLA architecture application on Whisper model. Through systematic hardware/software co-design, the authors demonstrate IMAX's significant energy efficiency advantage over GPUs (Q8_0 model 9.83× better than RTX 4090). While having limitations such as insufficiently rigorous power evaluation methodology and absolute performance inferior to GPUs, the method has important practical value and research significance in power-constrained edge device scenarios. The optimal 32KB LMM configuration selection, 93.80% kernel coverage improvement from padding elimination technique, and scalability analysis for larger models all demonstrate the authors' deep engineering insights. Future actual ASIC tape-out verification and supplementary precise power measurements will further strengthen this work's persuasiveness and impact.