2025-11-13T15:25:11.338171

Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks

Athanasiadis, Tampouratzis, Papaefstathiou

The growing demand for real-time processing in artificial intelligence applications, particularly those involving Convolutional Neural Networks (CNNs), has highlighted the need for efficient computational solutions. Conventional processors, very often, fall short in balancing performance, power consumption, and latency, especially in embedded systems and edge computing platforms. Field-Programmable Gate Arrays (FPGAs) offer a promising alternative, combining high performance with energy efficiency and reconfigurability. The presented framework addresses the complex and demanding computations of CNNs on FPGAs maintaining full precision in all neural network parameters. Specifically, our framework is based on Darknet which is very widely used for the design of CNNs and allows the designer, by using a similar input to that given to Darknet, to efficiently implement a CNN in a heterogeneous system comprising of CPUs and FPGAs. When compared with the FPGA frameworks that support quantization, our solution aims to offer similar performance and/or energy efficiency without any degradation on the NN accuracy.

academic

Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks

Basic Information

Paper ID: 2510.13362
Title: Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks
Authors: Angelos Athanasiadis¹, Nikolaos Tampouratzis², Ioannis Papaefstathiou¹
Institutions: ¹Aristotle University of Thessaloniki, ²International Hellenic University
Classification: cs.AR (Computer Architecture)
Paper Link: https://arxiv.org/abs/2510.13362

Abstract

With the growing demand for real-time processing in artificial intelligence applications, particularly those involving convolutional neural networks (CNNs), the need for efficient computational solutions has become increasingly prominent. Traditional processors often fall short in balancing performance, power consumption, and latency, especially on embedded systems and edge computing platforms. Field-Programmable Gate Arrays (FPGAs) offer a promising alternative, combining high performance, energy efficiency, and reconfigurability. The framework proposed in this paper addresses the complex computational requirements of CNNs on FPGAs while maintaining full precision for all neural network parameters. Based on the widely-used Darknet CNN design framework, this framework allows designers to use Darknet-like inputs for efficient CNN implementation in heterogeneous systems containing both CPUs and FPGAs. Compared to FPGA frameworks supporting quantization, this solution aims to deliver similar performance and/or energy efficiency without sacrificing neural network accuracy.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is how to efficiently implement non-quantized convolutional neural networks on FPGAs while achieving high performance and energy efficiency while maintaining full-precision parameters.

Problem Significance

Growing Real-Time Processing Demands: AI applications, particularly CNN-based applications, increasingly require real-time processing capabilities
Limitations of Traditional Processors: Conventional CPUs fall short in balancing performance, power consumption, and latency
Embedded and Edge Computing Challenges: Resource-constrained devices require more efficient computational solutions

Limitations of Existing Approaches

Accuracy Loss from Quantization: Existing FPGA frameworks primarily focus on quantized models, which reduce resource usage and power consumption but often sacrifice accuracy
Design Complexity: Lack of user-friendly and efficient design workflows
Performance-Precision Trade-offs: Difficulty in achieving high performance and energy efficiency while maintaining full precision

Research Motivation

To develop a framework capable of implementing non-quantized CNNs on FPGAs that maintains high model accuracy while achieving excellent performance and energy efficiency.

Core Contributions

Accuracy Preservation: By avoiding quantization and retaining full precision, the framework aims to maintain CNN model accuracy
High Design Productivity and Flexibility: Based on the widely-used DarkNet CNN design framework, implemented in pure C/C++, supporting FPGAs ranging from small to large scales
High Performance: Fully leverages the parallelism of any FPGA to accelerate CNN inference, ensuring timely and efficient processing
Energy Efficiency Optimization: Optimizes power efficiency for CNN inference on FPGAs, suitable for power-sensitive applications

Methodology Details

Task Definition

This research focuses on implementing efficient non-quantized CNN inference on FPGAs, with inputs being CNN model configuration files (Darknet-like format) and outputs being high-performance CNN implementations on CPU-FPGA heterogeneous systems.

Framework Architecture

As shown in Figure 1, the framework employs the following architectural design:

Input Processing: Import new cfg files into the tool
Preprocessing: Parallel preprocessing using OpenMP
Parser: Parse network structure, identify convolutional layers, deconvolutional layers, and other layer types
Computation Engine: Innovative HLS computation engine as the core component
Parallel Processing: Parallel processing using OpenMP
FPGA Implementation: Final neural network implementation on FPGA

Innovative HLS Computation Engine

Core Design Philosophy

The innovative computation engine employs High-Level Synthesis (HLS) technology, capable of executing multiple mathematical operations within a single clock cycle, achieving relatively high throughput and performance.

Technical Implementation Details

As shown in Figure 2, the HLS FPGA kernel primarily handles matrix multiplication tasks, which form the foundation of nearly all CNN implementations:

Memory Optimization: Leverages internal BRAM combined with HLS streams to optimize on-chip memory access patterns
Stream Processing Mechanism:
- Implements continuous data flow between processing elements without intermediate storage in BRAM
- Reduces latency and resource overhead
- Supports pipelined execution and enhances parallelism
- Transfers data directly between producer and consumer processes
Multi-Memory Channel Utilization:
- Exploits multiple memory banks and dedicated channels connected to modern FPGAs
- Inserts appropriate HLS directives to distribute data transfers across a parameterizable number of memory banks/channels
- Fully utilizes available bandwidth from each memory interface
High-Bandwidth Data Transfer: Data transfer between CPU and FPGA occurs at full data width (512 bits) per clock cycle, ensuring high-throughput communication between processing elements and memory subsystems

Technical Innovations

Full-Precision Preservation: Unlike existing quantization methods, this framework maintains full precision for all parameters
Stream Processing Optimization: Innovative stream processing mechanism reduces BRAM dependency and improves resource utilization efficiency
Multi-Channel Memory Access: Fully exploits multi-memory channel characteristics of modern FPGAs
Darknet-Based Design Flow: Provides a familiar and user-friendly design interface

Experimental Setup

Hardware Platforms

High-End FPGA: AMD Alveo U55C
Embedded FPGA: Kria KR260
Reference CPUs: Intel Xeon E5-2620 v4 (8-core) and ARM Cortex-A53 (4-core)
Reference GPU: NVIDIA T4

Test Configuration

Matrix Dimensions: M=2048, K=4096, N=16384
Data Type: FP32 (32-bit floating-point)
Test Purpose: Non-peak performance matrix dimensions selected to demonstrate method flexibility

Evaluation Metrics

Performance: GFLOPS (Giga Floating-Point Operations Per Second)
Energy Efficiency: GFLOPS/Watt
Speedup: Performance improvement relative to reference implementations and CPU parallel implementations

Experimental Results

Primary Performance Results

Embedded FPGA (Kria KR260)

Relative to Reference Implementation: Two orders of magnitude performance improvement
Relative to ARM 4-Core CPU: 9× performance improvement
Energy Efficiency Improvement: 9× improvement compared to best CPU parallel implementation

High-End FPGA (Alveo U55C)

Relative to Reference Implementation: Approximately three orders of magnitude performance improvement
Relative to Intel Xeon CPU: 10× performance improvement
Energy Efficiency Improvement: 34× improvement compared to best CPU parallel implementation
Relative to NVIDIA T4 GPU: 3× energy efficiency improvement (despite T4 using more advanced 12nm process while U55C uses 16nm)

Key Findings

Significant Performance Improvements: Achieved orders of magnitude performance improvements across all tested platforms
Excellent Energy Efficiency: Particularly achieved 34× energy efficiency improvement on Alveo U55C
Technical Advantages: Surpassed GPU energy efficiency even with process technology disadvantage
Consistency Verification: Experimental results for different matrix dimensions completely consistent with results shown in Figure 3

The paper references the following related works:

Xu et al. (2024): FLARE - FPGA-based full-precision low-power CNN accelerator with reconfigurable architecture
Chen et al. (2021): Learning framework for n-bit quantized neural networks toward FPGAs
Latotzke et al. (2022): Design of high-throughput mixed-precision CNN accelerators on FPGA

The main distinction of this paper from related work lies in its focus on non-quantized implementation, achieving high performance and energy efficiency while maintaining full precision.

Conclusions and Discussion

Main Conclusions

Successfully Addresses Key Requirements: This research successfully addresses the critical need for efficient CNN implementation in power-constrained environments
Balances Performance and Energy Efficiency: The proposed non-quantized FPGA CNN framework successfully combines high performance and energy efficiency
Ensures Accuracy: Achieves high accuracy by maintaining full precision of network parameters without compromising resource utilization or power consumption
Experimental Validation: Experimental results validate framework effectiveness, demonstrating significant acceleration of inference processing and substantial reduction in power usage

Limitations

Limited Test Scope: Experiments primarily focus on matrix multiplication operations, with incomplete detailed results for complete CNN networks
Accuracy Verification: While claiming accuracy preservation, lacks specific accuracy comparison data
Applicability Range: Framework applicability may be limited by FPGA resources and specific application requirements

Future Directions

The paper does not explicitly mention specific future research directions, but can be inferred to include:

More extensive CNN network testing and validation
Further energy efficiency optimization
Support for additional types of neural network layers

In-Depth Evaluation

Strengths

Technical Innovation:
- Achieves high-performance FPGA CNN implementation while maintaining full precision
- Innovative HLS computation engine design effectively leverages stream processing and multi-memory channels
Experimental Comprehensiveness:
- Comprehensive testing across multiple hardware platforms
- Includes comparative experiments with CPUs and GPUs
- Detailed measurements of both performance and energy efficiency metrics
Practical Value:
- Based on widely-used Darknet framework, easy to adopt
- Supports FPGAs ranging from small to large scales
- Applicable to power-sensitive application scenarios
Result Convincingness:
- Achieves orders of magnitude performance improvements
- Excellent performance across multiple metrics
- Surpasses GPU energy efficiency even with process technology disadvantage

Weaknesses

Insufficient Completeness Verification:
- Lacks end-to-end test results for complete CNN networks
- No specific accuracy preservation verification data provided
- Testing primarily concentrated at matrix multiplication level
Benchmark Selection:
- Reference implementation may not be sufficiently optimized
- Lacks comparison with other advanced FPGA CNN frameworks
Insufficient Technical Details:
- HLS implementation optimization strategies not described in sufficient detail
- Missing resource utilization data
- Insufficient analysis of memory bandwidth utilization efficiency
Applicability Analysis:
- Insufficient discussion of method limitations and applicable scope
- Insufficient analysis of scalability for different CNN scales

Impact Assessment

Academic Contribution:
- Provides new solution for non-quantized FPGA CNN implementation
- Achieves high performance while maintaining accuracy, possessing important theoretical value
Practical Value:
- Based on mature toolchain, facilitating engineering implementation
- Applicable to edge computing and embedded AI applications
Reproducibility:
- Based on standard HLS tools and open-source Darknet framework
- Relatively clear technical approach with certain reproducibility

Applicable Scenarios

Edge AI Applications: Power-sensitive scenarios with high accuracy requirements
Real-Time Image Processing: Visual processing tasks requiring low latency and high performance
Embedded Systems: Resource-constrained devices requiring AI capabilities
Industrial Automation: Industrial AI applications with high reliability and accuracy requirements

References

1 Xu, Y.; Luo, J.; Sun, W. Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure. Sensors 2024, 24

2 Chen, J.; Liu, L.; Liu, Y.; Zeng, X. A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs. IEEE Transactions on Neural Networks and Learning Systems 2021, 32, 1067–1081.

3 Latotzke, C.; Ciesielski, T.; Gemmeke, T. Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), 2022, pp. 358–365.

Overall Assessment: This is a practically valuable paper in the FPGA CNN accelerator field, proposing an innovative solution that maintains full precision with impressive experimental results. However, the paper has room for improvement in completeness verification and technical detail description. For AI application scenarios requiring high accuracy, this framework has important application prospects.