The growing demand for real-time processing in artificial intelligence applications, particularly those involving Convolutional Neural Networks (CNNs), has highlighted the need for efficient computational solutions. Conventional processors, very often, fall short in balancing performance, power consumption, and latency, especially in embedded systems and edge computing platforms. Field-Programmable Gate Arrays (FPGAs) offer a promising alternative, combining high performance with energy efficiency and reconfigurability. The presented framework addresses the complex and demanding computations of CNNs on FPGAs maintaining full precision in all neural network parameters. Specifically, our framework is based on Darknet which is very widely used for the design of CNNs and allows the designer, by using a similar input to that given to Darknet, to efficiently implement a CNN in a heterogeneous system comprising of CPUs and FPGAs. When compared with the FPGA frameworks that support quantization, our solution aims to offer similar performance and/or energy efficiency without any degradation on the NN accuracy.
- Paper ID: 2510.13362
- Title: Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks
- Authors: Angelos Athanasiadis¹, Nikolaos Tampouratzis², Ioannis Papaefstathiou¹
- Institutions: ¹Aristotle University of Thessaloniki, ²International Hellenic University
- Classification: cs.AR (Computer Architecture)
- Paper Link: https://arxiv.org/abs/2510.13362
With the growing demand for real-time processing in artificial intelligence applications, particularly those involving convolutional neural networks (CNNs), the need for efficient computational solutions has become increasingly prominent. Traditional processors often fall short in balancing performance, power consumption, and latency, especially on embedded systems and edge computing platforms. Field-Programmable Gate Arrays (FPGAs) offer a promising alternative, combining high performance, energy efficiency, and reconfigurability. The framework proposed in this paper addresses the complex computational requirements of CNNs on FPGAs while maintaining full precision for all neural network parameters. Based on the widely-used Darknet CNN design framework, this framework allows designers to use Darknet-like inputs for efficient CNN implementation in heterogeneous systems containing both CPUs and FPGAs. Compared to FPGA frameworks supporting quantization, this solution aims to deliver similar performance and/or energy efficiency without sacrificing neural network accuracy.
The core problem addressed in this research is how to efficiently implement non-quantized convolutional neural networks on FPGAs while achieving high performance and energy efficiency while maintaining full-precision parameters.
- Growing Real-Time Processing Demands: AI applications, particularly CNN-based applications, increasingly require real-time processing capabilities
- Limitations of Traditional Processors: Conventional CPUs fall short in balancing performance, power consumption, and latency
- Embedded and Edge Computing Challenges: Resource-constrained devices require more efficient computational solutions
- Accuracy Loss from Quantization: Existing FPGA frameworks primarily focus on quantized models, which reduce resource usage and power consumption but often sacrifice accuracy
- Design Complexity: Lack of user-friendly and efficient design workflows
- Performance-Precision Trade-offs: Difficulty in achieving high performance and energy efficiency while maintaining full precision
To develop a framework capable of implementing non-quantized CNNs on FPGAs that maintains high model accuracy while achieving excellent performance and energy efficiency.
- Accuracy Preservation: By avoiding quantization and retaining full precision, the framework aims to maintain CNN model accuracy
- High Design Productivity and Flexibility: Based on the widely-used DarkNet CNN design framework, implemented in pure C/C++, supporting FPGAs ranging from small to large scales
- High Performance: Fully leverages the parallelism of any FPGA to accelerate CNN inference, ensuring timely and efficient processing
- Energy Efficiency Optimization: Optimizes power efficiency for CNN inference on FPGAs, suitable for power-sensitive applications
This research focuses on implementing efficient non-quantized CNN inference on FPGAs, with inputs being CNN model configuration files (Darknet-like format) and outputs being high-performance CNN implementations on CPU-FPGA heterogeneous systems.
As shown in Figure 1, the framework employs the following architectural design:
- Input Processing: Import new cfg files into the tool
- Preprocessing: Parallel preprocessing using OpenMP
- Parser: Parse network structure, identify convolutional layers, deconvolutional layers, and other layer types
- Computation Engine: Innovative HLS computation engine as the core component
- Parallel Processing: Parallel processing using OpenMP
- FPGA Implementation: Final neural network implementation on FPGA
The innovative computation engine employs High-Level Synthesis (HLS) technology, capable of executing multiple mathematical operations within a single clock cycle, achieving relatively high throughput and performance.
As shown in Figure 2, the HLS FPGA kernel primarily handles matrix multiplication tasks, which form the foundation of nearly all CNN implementations:
- Memory Optimization: Leverages internal BRAM combined with HLS streams to optimize on-chip memory access patterns
- Stream Processing Mechanism:
- Implements continuous data flow between processing elements without intermediate storage in BRAM
- Reduces latency and resource overhead
- Supports pipelined execution and enhances parallelism
- Transfers data directly between producer and consumer processes
- Multi-Memory Channel Utilization:
- Exploits multiple memory banks and dedicated channels connected to modern FPGAs
- Inserts appropriate HLS directives to distribute data transfers across a parameterizable number of memory banks/channels
- Fully utilizes available bandwidth from each memory interface
- High-Bandwidth Data Transfer: Data transfer between CPU and FPGA occurs at full data width (512 bits) per clock cycle, ensuring high-throughput communication between processing elements and memory subsystems
- Full-Precision Preservation: Unlike existing quantization methods, this framework maintains full precision for all parameters
- Stream Processing Optimization: Innovative stream processing mechanism reduces BRAM dependency and improves resource utilization efficiency
- Multi-Channel Memory Access: Fully exploits multi-memory channel characteristics of modern FPGAs
- Darknet-Based Design Flow: Provides a familiar and user-friendly design interface
- High-End FPGA: AMD Alveo U55C
- Embedded FPGA: Kria KR260
- Reference CPUs: Intel Xeon E5-2620 v4 (8-core) and ARM Cortex-A53 (4-core)
- Reference GPU: NVIDIA T4
- Matrix Dimensions: M=2048, K=4096, N=16384
- Data Type: FP32 (32-bit floating-point)
- Test Purpose: Non-peak performance matrix dimensions selected to demonstrate method flexibility
- Performance: GFLOPS (Giga Floating-Point Operations Per Second)
- Energy Efficiency: GFLOPS/Watt
- Speedup: Performance improvement relative to reference implementations and CPU parallel implementations
- Relative to Reference Implementation: Two orders of magnitude performance improvement
- Relative to ARM 4-Core CPU: 9× performance improvement
- Energy Efficiency Improvement: 9× improvement compared to best CPU parallel implementation
- Relative to Reference Implementation: Approximately three orders of magnitude performance improvement
- Relative to Intel Xeon CPU: 10× performance improvement
- Energy Efficiency Improvement: 34× improvement compared to best CPU parallel implementation
- Relative to NVIDIA T4 GPU: 3× energy efficiency improvement (despite T4 using more advanced 12nm process while U55C uses 16nm)
- Significant Performance Improvements: Achieved orders of magnitude performance improvements across all tested platforms
- Excellent Energy Efficiency: Particularly achieved 34× energy efficiency improvement on Alveo U55C
- Technical Advantages: Surpassed GPU energy efficiency even with process technology disadvantage
- Consistency Verification: Experimental results for different matrix dimensions completely consistent with results shown in Figure 3
The paper references the following related works:
- Xu et al. (2024): FLARE - FPGA-based full-precision low-power CNN accelerator with reconfigurable architecture
- Chen et al. (2021): Learning framework for n-bit quantized neural networks toward FPGAs
- Latotzke et al. (2022): Design of high-throughput mixed-precision CNN accelerators on FPGA
The main distinction of this paper from related work lies in its focus on non-quantized implementation, achieving high performance and energy efficiency while maintaining full precision.
- Successfully Addresses Key Requirements: This research successfully addresses the critical need for efficient CNN implementation in power-constrained environments
- Balances Performance and Energy Efficiency: The proposed non-quantized FPGA CNN framework successfully combines high performance and energy efficiency
- Ensures Accuracy: Achieves high accuracy by maintaining full precision of network parameters without compromising resource utilization or power consumption
- Experimental Validation: Experimental results validate framework effectiveness, demonstrating significant acceleration of inference processing and substantial reduction in power usage
- Limited Test Scope: Experiments primarily focus on matrix multiplication operations, with incomplete detailed results for complete CNN networks
- Accuracy Verification: While claiming accuracy preservation, lacks specific accuracy comparison data
- Applicability Range: Framework applicability may be limited by FPGA resources and specific application requirements
The paper does not explicitly mention specific future research directions, but can be inferred to include:
- More extensive CNN network testing and validation
- Further energy efficiency optimization
- Support for additional types of neural network layers
- Technical Innovation:
- Achieves high-performance FPGA CNN implementation while maintaining full precision
- Innovative HLS computation engine design effectively leverages stream processing and multi-memory channels
- Experimental Comprehensiveness:
- Comprehensive testing across multiple hardware platforms
- Includes comparative experiments with CPUs and GPUs
- Detailed measurements of both performance and energy efficiency metrics
- Practical Value:
- Based on widely-used Darknet framework, easy to adopt
- Supports FPGAs ranging from small to large scales
- Applicable to power-sensitive application scenarios
- Result Convincingness:
- Achieves orders of magnitude performance improvements
- Excellent performance across multiple metrics
- Surpasses GPU energy efficiency even with process technology disadvantage
- Insufficient Completeness Verification:
- Lacks end-to-end test results for complete CNN networks
- No specific accuracy preservation verification data provided
- Testing primarily concentrated at matrix multiplication level
- Benchmark Selection:
- Reference implementation may not be sufficiently optimized
- Lacks comparison with other advanced FPGA CNN frameworks
- Insufficient Technical Details:
- HLS implementation optimization strategies not described in sufficient detail
- Missing resource utilization data
- Insufficient analysis of memory bandwidth utilization efficiency
- Applicability Analysis:
- Insufficient discussion of method limitations and applicable scope
- Insufficient analysis of scalability for different CNN scales
- Academic Contribution:
- Provides new solution for non-quantized FPGA CNN implementation
- Achieves high performance while maintaining accuracy, possessing important theoretical value
- Practical Value:
- Based on mature toolchain, facilitating engineering implementation
- Applicable to edge computing and embedded AI applications
- Reproducibility:
- Based on standard HLS tools and open-source Darknet framework
- Relatively clear technical approach with certain reproducibility
- Edge AI Applications: Power-sensitive scenarios with high accuracy requirements
- Real-Time Image Processing: Visual processing tasks requiring low latency and high performance
- Embedded Systems: Resource-constrained devices requiring AI capabilities
- Industrial Automation: Industrial AI applications with high reliability and accuracy requirements
1 Xu, Y.; Luo, J.; Sun, W. Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure. Sensors 2024, 24
2 Chen, J.; Liu, L.; Liu, Y.; Zeng, X. A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs. IEEE Transactions on Neural Networks and Learning Systems 2021, 32, 1067–1081.
3 Latotzke, C.; Ciesielski, T.; Gemmeke, T. Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), 2022, pp. 358–365.
Overall Assessment: This is a practically valuable paper in the FPGA CNN accelerator field, proposing an innovative solution that maintains full precision with impressive experimental results. However, the paper has room for improvement in completeness verification and technical detail description. For AI application scenarios requiring high accuracy, this framework has important application prospects.