2025-11-13T15:25:11.338171

Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks

Athanasiadis, Tampouratzis, Papaefstathiou
The growing demand for real-time processing in artificial intelligence applications, particularly those involving Convolutional Neural Networks (CNNs), has highlighted the need for efficient computational solutions. Conventional processors, very often, fall short in balancing performance, power consumption, and latency, especially in embedded systems and edge computing platforms. Field-Programmable Gate Arrays (FPGAs) offer a promising alternative, combining high performance with energy efficiency and reconfigurability. The presented framework addresses the complex and demanding computations of CNNs on FPGAs maintaining full precision in all neural network parameters. Specifically, our framework is based on Darknet which is very widely used for the design of CNNs and allows the designer, by using a similar input to that given to Darknet, to efficiently implement a CNN in a heterogeneous system comprising of CPUs and FPGAs. When compared with the FPGA frameworks that support quantization, our solution aims to offer similar performance and/or energy efficiency without any degradation on the NN accuracy.
academic

Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks

Basic Information

  • Paper ID: 2510.13362
  • Title: Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks
  • Authors: Angelos Athanasiadis¹, Nikolaos Tampouratzis², Ioannis Papaefstathiou¹
  • Institutions: ¹Aristotle University of Thessaloniki, ²International Hellenic University
  • Classification: cs.AR (Computer Architecture)
  • Paper Link: https://arxiv.org/abs/2510.13362

Abstract

With the growing demand for real-time processing in artificial intelligence applications, particularly those involving convolutional neural networks (CNNs), the need for efficient computational solutions has become increasingly prominent. Traditional processors often fall short in balancing performance, power consumption, and latency, especially on embedded systems and edge computing platforms. Field-Programmable Gate Arrays (FPGAs) offer a promising alternative, combining high performance, energy efficiency, and reconfigurability. The framework proposed in this paper addresses the complex computational requirements of CNNs on FPGAs while maintaining full precision for all neural network parameters. Based on the widely-used Darknet CNN design framework, this framework allows designers to use Darknet-like inputs for efficient CNN implementation in heterogeneous systems containing both CPUs and FPGAs. Compared to FPGA frameworks supporting quantization, this solution aims to deliver similar performance and/or energy efficiency without sacrificing neural network accuracy.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is how to efficiently implement non-quantized convolutional neural networks on FPGAs while achieving high performance and energy efficiency while maintaining full-precision parameters.

Problem Significance

  1. Growing Real-Time Processing Demands: AI applications, particularly CNN-based applications, increasingly require real-time processing capabilities
  2. Limitations of Traditional Processors: Conventional CPUs fall short in balancing performance, power consumption, and latency
  3. Embedded and Edge Computing Challenges: Resource-constrained devices require more efficient computational solutions

Limitations of Existing Approaches

  1. Accuracy Loss from Quantization: Existing FPGA frameworks primarily focus on quantized models, which reduce resource usage and power consumption but often sacrifice accuracy
  2. Design Complexity: Lack of user-friendly and efficient design workflows
  3. Performance-Precision Trade-offs: Difficulty in achieving high performance and energy efficiency while maintaining full precision

Research Motivation

To develop a framework capable of implementing non-quantized CNNs on FPGAs that maintains high model accuracy while achieving excellent performance and energy efficiency.

Core Contributions

  1. Accuracy Preservation: By avoiding quantization and retaining full precision, the framework aims to maintain CNN model accuracy
  2. High Design Productivity and Flexibility: Based on the widely-used DarkNet CNN design framework, implemented in pure C/C++, supporting FPGAs ranging from small to large scales
  3. High Performance: Fully leverages the parallelism of any FPGA to accelerate CNN inference, ensuring timely and efficient processing
  4. Energy Efficiency Optimization: Optimizes power efficiency for CNN inference on FPGAs, suitable for power-sensitive applications

Methodology Details

Task Definition

This research focuses on implementing efficient non-quantized CNN inference on FPGAs, with inputs being CNN model configuration files (Darknet-like format) and outputs being high-performance CNN implementations on CPU-FPGA heterogeneous systems.

Framework Architecture

As shown in Figure 1, the framework employs the following architectural design:

  1. Input Processing: Import new cfg files into the tool
  2. Preprocessing: Parallel preprocessing using OpenMP
  3. Parser: Parse network structure, identify convolutional layers, deconvolutional layers, and other layer types
  4. Computation Engine: Innovative HLS computation engine as the core component
  5. Parallel Processing: Parallel processing using OpenMP
  6. FPGA Implementation: Final neural network implementation on FPGA

Innovative HLS Computation Engine

Core Design Philosophy

The innovative computation engine employs High-Level Synthesis (HLS) technology, capable of executing multiple mathematical operations within a single clock cycle, achieving relatively high throughput and performance.

Technical Implementation Details

As shown in Figure 2, the HLS FPGA kernel primarily handles matrix multiplication tasks, which form the foundation of nearly all CNN implementations:

  1. Memory Optimization: Leverages internal BRAM combined with HLS streams to optimize on-chip memory access patterns
  2. Stream Processing Mechanism:
    • Implements continuous data flow between processing elements without intermediate storage in BRAM
    • Reduces latency and resource overhead
    • Supports pipelined execution and enhances parallelism
    • Transfers data directly between producer and consumer processes
  3. Multi-Memory Channel Utilization:
    • Exploits multiple memory banks and dedicated channels connected to modern FPGAs
    • Inserts appropriate HLS directives to distribute data transfers across a parameterizable number of memory banks/channels
    • Fully utilizes available bandwidth from each memory interface
  4. High-Bandwidth Data Transfer: Data transfer between CPU and FPGA occurs at full data width (512 bits) per clock cycle, ensuring high-throughput communication between processing elements and memory subsystems

Technical Innovations

  1. Full-Precision Preservation: Unlike existing quantization methods, this framework maintains full precision for all parameters
  2. Stream Processing Optimization: Innovative stream processing mechanism reduces BRAM dependency and improves resource utilization efficiency
  3. Multi-Channel Memory Access: Fully exploits multi-memory channel characteristics of modern FPGAs
  4. Darknet-Based Design Flow: Provides a familiar and user-friendly design interface

Experimental Setup

Hardware Platforms

  • High-End FPGA: AMD Alveo U55C
  • Embedded FPGA: Kria KR260
  • Reference CPUs: Intel Xeon E5-2620 v4 (8-core) and ARM Cortex-A53 (4-core)
  • Reference GPU: NVIDIA T4

Test Configuration

  • Matrix Dimensions: M=2048, K=4096, N=16384
  • Data Type: FP32 (32-bit floating-point)
  • Test Purpose: Non-peak performance matrix dimensions selected to demonstrate method flexibility

Evaluation Metrics

  1. Performance: GFLOPS (Giga Floating-Point Operations Per Second)
  2. Energy Efficiency: GFLOPS/Watt
  3. Speedup: Performance improvement relative to reference implementations and CPU parallel implementations

Experimental Results

Primary Performance Results

Embedded FPGA (Kria KR260)

  • Relative to Reference Implementation: Two orders of magnitude performance improvement
  • Relative to ARM 4-Core CPU: 9× performance improvement
  • Energy Efficiency Improvement: 9× improvement compared to best CPU parallel implementation

High-End FPGA (Alveo U55C)

  • Relative to Reference Implementation: Approximately three orders of magnitude performance improvement
  • Relative to Intel Xeon CPU: 10× performance improvement
  • Energy Efficiency Improvement: 34× improvement compared to best CPU parallel implementation
  • Relative to NVIDIA T4 GPU: 3× energy efficiency improvement (despite T4 using more advanced 12nm process while U55C uses 16nm)

Key Findings

  1. Significant Performance Improvements: Achieved orders of magnitude performance improvements across all tested platforms
  2. Excellent Energy Efficiency: Particularly achieved 34× energy efficiency improvement on Alveo U55C
  3. Technical Advantages: Surpassed GPU energy efficiency even with process technology disadvantage
  4. Consistency Verification: Experimental results for different matrix dimensions completely consistent with results shown in Figure 3

The paper references the following related works:

  1. Xu et al. (2024): FLARE - FPGA-based full-precision low-power CNN accelerator with reconfigurable architecture
  2. Chen et al. (2021): Learning framework for n-bit quantized neural networks toward FPGAs
  3. Latotzke et al. (2022): Design of high-throughput mixed-precision CNN accelerators on FPGA

The main distinction of this paper from related work lies in its focus on non-quantized implementation, achieving high performance and energy efficiency while maintaining full precision.

Conclusions and Discussion

Main Conclusions

  1. Successfully Addresses Key Requirements: This research successfully addresses the critical need for efficient CNN implementation in power-constrained environments
  2. Balances Performance and Energy Efficiency: The proposed non-quantized FPGA CNN framework successfully combines high performance and energy efficiency
  3. Ensures Accuracy: Achieves high accuracy by maintaining full precision of network parameters without compromising resource utilization or power consumption
  4. Experimental Validation: Experimental results validate framework effectiveness, demonstrating significant acceleration of inference processing and substantial reduction in power usage

Limitations

  1. Limited Test Scope: Experiments primarily focus on matrix multiplication operations, with incomplete detailed results for complete CNN networks
  2. Accuracy Verification: While claiming accuracy preservation, lacks specific accuracy comparison data
  3. Applicability Range: Framework applicability may be limited by FPGA resources and specific application requirements

Future Directions

The paper does not explicitly mention specific future research directions, but can be inferred to include:

  1. More extensive CNN network testing and validation
  2. Further energy efficiency optimization
  3. Support for additional types of neural network layers

In-Depth Evaluation

Strengths

  1. Technical Innovation:
    • Achieves high-performance FPGA CNN implementation while maintaining full precision
    • Innovative HLS computation engine design effectively leverages stream processing and multi-memory channels
  2. Experimental Comprehensiveness:
    • Comprehensive testing across multiple hardware platforms
    • Includes comparative experiments with CPUs and GPUs
    • Detailed measurements of both performance and energy efficiency metrics
  3. Practical Value:
    • Based on widely-used Darknet framework, easy to adopt
    • Supports FPGAs ranging from small to large scales
    • Applicable to power-sensitive application scenarios
  4. Result Convincingness:
    • Achieves orders of magnitude performance improvements
    • Excellent performance across multiple metrics
    • Surpasses GPU energy efficiency even with process technology disadvantage

Weaknesses

  1. Insufficient Completeness Verification:
    • Lacks end-to-end test results for complete CNN networks
    • No specific accuracy preservation verification data provided
    • Testing primarily concentrated at matrix multiplication level
  2. Benchmark Selection:
    • Reference implementation may not be sufficiently optimized
    • Lacks comparison with other advanced FPGA CNN frameworks
  3. Insufficient Technical Details:
    • HLS implementation optimization strategies not described in sufficient detail
    • Missing resource utilization data
    • Insufficient analysis of memory bandwidth utilization efficiency
  4. Applicability Analysis:
    • Insufficient discussion of method limitations and applicable scope
    • Insufficient analysis of scalability for different CNN scales

Impact Assessment

  1. Academic Contribution:
    • Provides new solution for non-quantized FPGA CNN implementation
    • Achieves high performance while maintaining accuracy, possessing important theoretical value
  2. Practical Value:
    • Based on mature toolchain, facilitating engineering implementation
    • Applicable to edge computing and embedded AI applications
  3. Reproducibility:
    • Based on standard HLS tools and open-source Darknet framework
    • Relatively clear technical approach with certain reproducibility

Applicable Scenarios

  1. Edge AI Applications: Power-sensitive scenarios with high accuracy requirements
  2. Real-Time Image Processing: Visual processing tasks requiring low latency and high performance
  3. Embedded Systems: Resource-constrained devices requiring AI capabilities
  4. Industrial Automation: Industrial AI applications with high reliability and accuracy requirements

References

1 Xu, Y.; Luo, J.; Sun, W. Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure. Sensors 2024, 24

2 Chen, J.; Liu, L.; Liu, Y.; Zeng, X. A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs. IEEE Transactions on Neural Networks and Learning Systems 2021, 32, 1067–1081.

3 Latotzke, C.; Ciesielski, T.; Gemmeke, T. Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), 2022, pp. 358–365.


Overall Assessment: This is a practically valuable paper in the FPGA CNN accelerator field, proposing an innovative solution that maintains full precision with impressive experimental results. However, the paper has room for improvement in completeness verification and technical detail description. For AI application scenarios requiring high accuracy, this framework has important application prospects.