2025-11-20T07:07:14.857348

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix-$2^k$ FFT in Large Size Processing

Zhao, Xiao, Wang et al.
In the field of digital signal processing, the fast Fourier transform (FFT) is a fundamental algorithm, with its processors being implemented using either the pipelined architecture, well-known for high-throughput applications but weak in hardware utilization, or the memory-based architecture, designed for area-constrained scenarios but failing to meet stringent throughput requirements. Therefore, we propose an adaptive hybrid FFT, which leverages the strengths of both pipelined and memory-based architectures. In this paper, we propose an adaptive hybrid FFT processor that combines the advantages of both architectures, and it has the following features. First, a set of radix-$2^k$ multi-path delay commutators (MDC) units are developed to support high-performance large-size processing. Second, a conflict-free memory access scheme is formulated to ensure a continuous data flow without data contention. Third, We demonstrate the existence of a series of bit-dimension permutations for reordering input data, satisfying the generalized constraints of variable-length, high-radix, and any level of parallelism for wide adaptivity. Furthermore, the proposed FFT processor has been implemented on a field-programmable gate array (FPGA). As a result, the proposed work outperforms conventional memory-based FFT processors by requiring fewer computation cycles. It achieves higher hardware utilization than pipelined FFT architectures, making it suitable for highly demanding applications.
academic

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix-2k2^k FFT in Large Size Processing

Basic Information

  • Paper ID: 2501.01259
  • Title: Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix-2k2^k FFT in Large Size Processing
  • Authors: Fangyu Zhao, Chunhua Xiao, Zhiguo Wang, Xiaohua Du, Bo Dong
  • Classification: cs.AR (Computer Architecture)
  • Publication Date/Venue: Submitted to IEEE, January 2025
  • Paper Link: https://arxiv.org/abs/2501.01259

Abstract

The Fast Fourier Transform (FFT) is a fundamental algorithm in digital signal processing. Processor implementations of FFT typically employ two architectural paradigms: pipeline architecture (suitable for high-throughput applications but with low hardware utilization) and memory-based architecture (suitable for area-constrained scenarios but unable to meet stringent throughput requirements). This paper proposes an adaptive hybrid FFT architecture that combines the advantages of both approaches. The architecture is characterized by: (1) development of a set of radix-2k2^k multi-path delay commutator (MDC) units to support high-performance large-scale processing; (2) formulation of a conflict-free memory access scheme ensuring continuous data flow; (3) proof of the existence of a series of bit-dimension permutations satisfying broad adaptability requirements for variable lengths, high radices, and arbitrary parallelism degrees.

Research Background and Motivation

Problem Definition

  1. Core Problem: Traditional FFT processor architectures suffer from inherent deficiencies
    • Pipeline architecture: High throughput but low hardware utilization, with substantial hardware idleness during small-scale FFT operations
    • Memory-based architecture: High hardware utilization but increased computation cycles, affecting real-time processing performance
  2. Problem Significance:
    • FFT is widely applied in wireless communications, image processing, radar signal processing, and other domains
    • Growing demand for large-scale data processing requires both efficient and flexible FFT processors
    • Existing architectures cannot simultaneously satisfy high throughput and high hardware utilization requirements
  3. Limitations of Existing Methods:
    • Pipeline architecture can have hardware utilization as low as 15% when processing small-scale FFTs
    • Memory-based architecture requires multiple iterations, increasing computational latency
    • Existing conflict avoidance schemes are primarily limited to radix-2 algorithms and do not support high-radix computation
  4. Research Motivation:
    • Combine advantages of both architectures to achieve adaptive reconfiguration
    • Support large-scale FFT processing (up to 512K points)
    • Improve hardware utilization while ensuring high throughput

Core Contributions

  1. Proposed Adaptive Hybrid FFT Processor Architecture: Supports both pipeline and memory-based modes, capable of processing FFTs up to 512K points
  2. Developed Radix-2k2^k Multi-Path Delay Commutator (MDC): Supports radix-252^5 algorithm, significantly reducing the number of computation stages
  3. Designed Conflict-Free Memory Access Technique: Enables continuous-flow FFT computation with fully in-place memory transformation
  4. Constructed Universal Bit-Permutation Method: Adapts to hardware constraints of different FFT lengths, radices, and parallelism degrees

Methodology Details

Task Definition

Design a reconfigurable FFT processor capable of:

  • Input: N-point complex sequence (N = 2^n, maximum 512K)
  • Output: Corresponding frequency-domain representation
  • Constraints: Support radix-2k2^k (k≤5) algorithms, configurable parallelism degree P, achieving conflict-free memory access

Model Architecture

1. Top-Level Architecture Design

Input Data → Data Reordering Module → FFT Core Processor → Output Data
           ↑                          ↑
    Memory Bank Group          MDC Unit Group
    Address Generation Unit    (P parallel units)
    Parallel Branch Permutation Circuit
    Reordering Circuit

2. Key Component Details

Multi-Path Delay Commutator (MDC) Unit:

  • Supports radix-252^5/24/23/22 hybrid computation
  • Employs modified radix-252^5 algorithm, classifying twiddle factors as:
    • Constant (C): Pre-stored in ROM
    • Non-trivial (NT): Requires complex multipliers
    • Trivial (T): Simple ±1, ±j operations

Data Reordering Strategy: Based on bit-dimension permutation implementing three-level transformation: σNs,k,P=σN,3s,k,PσN,2s,k,PσN,1s,k,P\sigma^{s,k,P}_N = \sigma^{s,k,P}_{N,3} \circ \sigma^{s,k,P}_{N,2} \circ \sigma^{s,k,P}_{N,1}

Where:

  • σN,1s,k,P\sigma^{s,k,P}_{N,1}: Serial bit-dimension permutation
  • σN,2s,k,P\sigma^{s,k,P}_{N,2}: Parallel branch exchange
  • σN,3s,k,P\sigma^{s,k,P}_{N,3}: Fine-grained index adjustment

3. Conflict-Free Memory Access Scheme

Pipeline Mode:

  • Employs interleaved addressing pattern: natural order and reversed order
  • Read-write address relationship: σWi=σRi1\sigma^i_W = \sigma^{i-1}_R
  • Ensures continuous data flow without conflicts

Memory-Based Mode:

  • Introduces additional permutation σ~N,1s,k,P\tilde{\sigma}^{s,k,P}_{N,1} for in-place storage
  • Applicable for N ∈ (2^{2k}, 2^{3k}] large-scale processing

Technical Innovations

  1. Unified Radix-2k2^k Architecture: Achieves hardware reuse through algorithm modification, supporting multiple radices with a single hardware set
  2. Adaptive Reconfiguration Capability: Dynamically selects operating mode based on FFT size and performance requirements
  3. Universal Bit-Permutation Theory: Extends existing methods to support arbitrary radices, lengths, and parallelism degrees
  4. Optimized Memory Access Patterns: Designs specialized conflict-free access strategies for different modes

Experimental Setup

Hardware Platform

  • FPGA: Xilinx Virtex UltraScale+ VCU118 (xcvu9p-flga2104-2L-e)
  • Development Tools: Chisel HDL, Xilinx Vivado 2019.2
  • Memory Implementation:
    • Data Storage: Ultra RAMs (URAMs), 256K address × 32-bit per memory
    • Twiddle Factor Storage: Block RAMs (BRAMs)

Evaluation Metrics

  1. Hardware Utilization: Average proportion of active butterfly units
  2. Computation Cycles: Clock cycles required to complete FFT
  3. Processing Time: Number of iterations × cycles per iteration
  4. Resource Consumption: Usage of DSP48E2, LUT, FF, and other hardware resources

Comparison Methods

  1. Memory-Based Architecture: Tsai'11, Kaya'23, Wang'20
  2. Pipeline Architecture: Garrido'13

Experimental Results

Main Results

1. Comparison with Memory-Based Architectures

ArchitectureRadixFFT LengthParallelismIterationsProcessing Time Reduction
Tsai'11radix-2³64~4K2⌈n/3⌉70%+
Kaya'23radix-22K~16K2⌈n/2⌉70%+
Wang'20radix-2³32~32K4⌈n/3⌉70%+
This Workradix-2⁵32~512K8⌈n/5⌉Baseline

2. Comparison with Pipeline Architectures

ConfigurationFFT LengthAverage Hardware UtilizationImprovement
Garrido'13 (P=1)2K~512K75%-
Garrido'13 (P=1)64~1K40%-
Garrido'13 (P=1)2~3215%-
This Work (P=1)2K~512K75%Comparable
This Work (P=2)64~1K80%2× Improvement
This Work (P=4)2~3260%4× Improvement

3. FPGA Implementation Results (N=512K, P=1)

  • DSP48E2: 45,365 units
  • LUT: 76,183 units
  • FF: 1,500 units
  • Block RAMs: 444 units
  • Ultra RAMs: 768 units
  • Operating Frequency: 196.8 MHz

Key Findings

  1. Computational Efficiency Improvement: Through radix-252^5 algorithm, iteration count reduced to ⌈n/5⌉, achieving over 40% reduction compared to traditional methods
  2. Hardware Utilization Optimization: Through adaptive parallelism, hardware utilization for small-scale FFTs improved by 2-4 times
  3. Enhanced Scalability: Supports wide-range FFT processing from 32 points to 512K points

Main Research Directions

  1. Pipeline FFT Architecture: Represented by Groginsky & Works (1970), pursuing high throughput
  2. Memory-Based FFT Architecture: Aimed at reducing hardware resources, suitable for area-constrained applications
  3. High-Radix FFT Algorithms: Radix-2k2^k algorithms balance computational complexity and hardware implementation difficulty

Advantages of This Work

  1. Architecture Fusion: First to achieve adaptive switching between pipeline and memory-based architectures
  2. Radix Extension: Supports up to radix-252^5, exceeding existing radix-232^3 limitations
  3. Theoretical Completeness: Provides a universal bit-permutation theoretical framework

Conclusions and Discussion

Main Conclusions

  1. The adaptive hybrid architecture successfully combines advantages of both pipeline and memory-based architectures
  2. Radix-252^5 MDC design significantly improves computational efficiency for large-scale FFTs
  3. The universal bit-permutation method provides theoretical guarantees for different configurations
  4. Experimental validation demonstrates significant improvements in hardware utilization and computational efficiency

Limitations

  1. Scope Limitation: Memory-based mode only applicable for N ∈ (2^{2k}, 2^{3k}]
  2. Hardware Complexity: Supporting multiple radices increases control logic complexity
  3. Missing Power Analysis: Lacks detailed power consumption comparison analysis

Future Directions

  1. Extend support for even larger-scale FFT processing
  2. Optimize power efficiency
  3. Explore applications in AI accelerators

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose adaptive hybrid FFT architecture, resolving inherent contradictions in traditional architectures
  2. Theoretical Completeness: Provides comprehensive bit-permutation theoretical framework with strong generality
  3. Sufficient Experimentation: Validates methodology effectiveness from theoretical analysis to FPGA implementation
  4. High Practical Value: Supports 512K-point FFT, meeting modern large-scale data processing demands

Weaknesses

  1. Increased Complexity: Adaptive mechanisms increase design complexity and verification difficulty
  2. Incomplete Comparison: Lacks performance comparison with latest commercial FFT IP cores
  3. Missing Power Analysis: Power consumption is an important consideration factor in mobile and embedded applications

Impact

  1. Academic Contribution: Provides new architectural paradigm for FFT processor design
  2. Engineering Value: Directly applicable to 5G communications, radar signal processing, and other domains
  3. Reproducibility: Provides detailed design parameters and implementation details

Applicable Scenarios

  1. High-Performance Computing: Scientific computing applications requiring large-scale FFT processing
  2. Communication Systems: Signal processing units in 5G/6G base stations
  3. Radar Systems: Real-time signal processing and target detection
  4. Image Processing: Frequency-domain analysis of high-resolution images

References

The paper cites 17 relevant references covering FFT algorithms, FPGA implementations, memory access optimization, and other aspects, providing solid theoretical foundation for the research.


Overall Assessment: This is a high-quality computer architecture paper with significant theoretical and practical value in the field of FFT processor design. Through ingenious architectural design and rigorous theoretical analysis, the authors successfully address inherent problems in traditional FFT architectures, providing new insights and directions for the field's development.