2025-11-20T07:07:14.857348

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix-$2^k$ FFT in Large Size Processing

Zhao, Xiao, Wang et al.

In the field of digital signal processing, the fast Fourier transform (FFT) is a fundamental algorithm, with its processors being implemented using either the pipelined architecture, well-known for high-throughput applications but weak in hardware utilization, or the memory-based architecture, designed for area-constrained scenarios but failing to meet stringent throughput requirements. Therefore, we propose an adaptive hybrid FFT, which leverages the strengths of both pipelined and memory-based architectures. In this paper, we propose an adaptive hybrid FFT processor that combines the advantages of both architectures, and it has the following features. First, a set of radix-$2^k$ multi-path delay commutators (MDC) units are developed to support high-performance large-size processing. Second, a conflict-free memory access scheme is formulated to ensure a continuous data flow without data contention. Third, We demonstrate the existence of a series of bit-dimension permutations for reordering input data, satisfying the generalized constraints of variable-length, high-radix, and any level of parallelism for wide adaptivity. Furthermore, the proposed FFT processor has been implemented on a field-programmable gate array (FPGA). As a result, the proposed work outperforms conventional memory-based FFT processors by requiring fewer computation cycles. It achieves higher hardware utilization than pipelined FFT architectures, making it suitable for highly demanding applications.

academic

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix- $2^k$ FFT in Large Size Processing

Basic Information

Paper ID: 2501.01259
Title: Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix- $2^k$ FFT in Large Size Processing
Authors: Fangyu Zhao, Chunhua Xiao, Zhiguo Wang, Xiaohua Du, Bo Dong
Classification: cs.AR (Computer Architecture)
Publication Date/Venue: Submitted to IEEE, January 2025
Paper Link: https://arxiv.org/abs/2501.01259

Abstract

The Fast Fourier Transform (FFT) is a fundamental algorithm in digital signal processing. Processor implementations of FFT typically employ two architectural paradigms: pipeline architecture (suitable for high-throughput applications but with low hardware utilization) and memory-based architecture (suitable for area-constrained scenarios but unable to meet stringent throughput requirements). This paper proposes an adaptive hybrid FFT architecture that combines the advantages of both approaches. The architecture is characterized by: (1) development of a set of radix- $2^k$ multi-path delay commutator (MDC) units to support high-performance large-scale processing; (2) formulation of a conflict-free memory access scheme ensuring continuous data flow; (3) proof of the existence of a series of bit-dimension permutations satisfying broad adaptability requirements for variable lengths, high radices, and arbitrary parallelism degrees.

Research Background and Motivation

Problem Definition

Core Problem: Traditional FFT processor architectures suffer from inherent deficiencies
- Pipeline architecture: High throughput but low hardware utilization, with substantial hardware idleness during small-scale FFT operations
- Memory-based architecture: High hardware utilization but increased computation cycles, affecting real-time processing performance
Problem Significance:
- FFT is widely applied in wireless communications, image processing, radar signal processing, and other domains
- Growing demand for large-scale data processing requires both efficient and flexible FFT processors
- Existing architectures cannot simultaneously satisfy high throughput and high hardware utilization requirements
Limitations of Existing Methods:
- Pipeline architecture can have hardware utilization as low as 15% when processing small-scale FFTs
- Memory-based architecture requires multiple iterations, increasing computational latency
- Existing conflict avoidance schemes are primarily limited to radix-2 algorithms and do not support high-radix computation
Research Motivation:
- Combine advantages of both architectures to achieve adaptive reconfiguration
- Support large-scale FFT processing (up to 512K points)
- Improve hardware utilization while ensuring high throughput

Core Contributions

Proposed Adaptive Hybrid FFT Processor Architecture: Supports both pipeline and memory-based modes, capable of processing FFTs up to 512K points
Developed Radix- $2^k$ Multi-Path Delay Commutator (MDC): Supports radix- $2^5$ algorithm, significantly reducing the number of computation stages
Designed Conflict-Free Memory Access Technique: Enables continuous-flow FFT computation with fully in-place memory transformation
Constructed Universal Bit-Permutation Method: Adapts to hardware constraints of different FFT lengths, radices, and parallelism degrees

Methodology Details

Task Definition

Design a reconfigurable FFT processor capable of:

Input: N-point complex sequence (N = 2^n, maximum 512K)
Output: Corresponding frequency-domain representation
Constraints: Support radix- $2^k$ (k≤5) algorithms, configurable parallelism degree P, achieving conflict-free memory access

Model Architecture

1. Top-Level Architecture Design

Input Data → Data Reordering Module → FFT Core Processor → Output Data
           ↑                          ↑
    Memory Bank Group          MDC Unit Group
    Address Generation Unit    (P parallel units)
    Parallel Branch Permutation Circuit
    Reordering Circuit

2. Key Component Details

Multi-Path Delay Commutator (MDC) Unit:

Supports radix- $2^5$ /24/23/22 hybrid computation
Employs modified radix- $2^5$ $2^{5}$ algorithm, classifying twiddle factors as:
- Constant (C): Pre-stored in ROM
- Non-trivial (NT): Requires complex multipliers
- Trivial (T): Simple ±1, ±j operations

Data Reordering Strategy: Based on bit-dimension permutation implementing three-level transformation: $\sigma^{s,k,P}_N = \sigma^{s,k,P}_{N,3} \circ \sigma^{s,k,P}_{N,2} \circ \sigma^{s,k,P}_{N,1}$

Where:

$\sigma^{s,k,P}_{N,1}$ : Serial bit-dimension permutation
$\sigma^{s,k,P}_{N,2}$ : Parallel branch exchange
$\sigma^{s,k,P}_{N,3}$ : Fine-grained index adjustment

3. Conflict-Free Memory Access Scheme

Pipeline Mode:

Employs interleaved addressing pattern: natural order and reversed order
Read-write address relationship: $\sigma^i_W = \sigma^{i-1}_R$
Ensures continuous data flow without conflicts

Memory-Based Mode:

Introduces additional permutation $\tilde{\sigma}^{s,k,P}_{N,1}$ for in-place storage
Applicable for N ∈ (2^{2k}, 2^{3k}] large-scale processing

Technical Innovations

Unified Radix- $2^k$ Architecture: Achieves hardware reuse through algorithm modification, supporting multiple radices with a single hardware set
Adaptive Reconfiguration Capability: Dynamically selects operating mode based on FFT size and performance requirements
Universal Bit-Permutation Theory: Extends existing methods to support arbitrary radices, lengths, and parallelism degrees
Optimized Memory Access Patterns: Designs specialized conflict-free access strategies for different modes

Experimental Setup

Hardware Platform

FPGA: Xilinx Virtex UltraScale+ VCU118 (xcvu9p-flga2104-2L-e)
Development Tools: Chisel HDL, Xilinx Vivado 2019.2
Memory Implementation:
- Data Storage: Ultra RAMs (URAMs), 256K address × 32-bit per memory
- Twiddle Factor Storage: Block RAMs (BRAMs)

Evaluation Metrics

Hardware Utilization: Average proportion of active butterfly units
Computation Cycles: Clock cycles required to complete FFT
Processing Time: Number of iterations × cycles per iteration
Resource Consumption: Usage of DSP48E2, LUT, FF, and other hardware resources

Comparison Methods

Memory-Based Architecture: Tsai'11, Kaya'23, Wang'20
Pipeline Architecture: Garrido'13

Experimental Results

Main Results

1. Comparison with Memory-Based Architectures

Architecture	Radix	FFT Length	Parallelism	Iterations	Processing Time Reduction
Tsai'11	radix-2³	64~4K	2	⌈n/3⌉	70%+
Kaya'23	radix-2	2K~16K	2	⌈n/2⌉	70%+
Wang'20	radix-2³	32~32K	4	⌈n/3⌉	70%+
This Work	radix-2⁵	32~512K	8	⌈n/5⌉	Baseline

2. Comparison with Pipeline Architectures

Configuration	FFT Length	Average Hardware Utilization	Improvement
Garrido'13 (P=1)	2K~512K	75%	-
Garrido'13 (P=1)	64~1K	40%	-
Garrido'13 (P=1)	2~32	15%	-
This Work (P=1)	2K~512K	75%	Comparable
This Work (P=2)	64~1K	80%	2× Improvement
This Work (P=4)	2~32	60%	4× Improvement

3. FPGA Implementation Results (N=512K, P=1)

DSP48E2: 45,365 units
LUT: 76,183 units
FF: 1,500 units
Block RAMs: 444 units
Ultra RAMs: 768 units
Operating Frequency: 196.8 MHz

Key Findings

Computational Efficiency Improvement: Through radix- $2^5$ algorithm, iteration count reduced to ⌈n/5⌉, achieving over 40% reduction compared to traditional methods
Hardware Utilization Optimization: Through adaptive parallelism, hardware utilization for small-scale FFTs improved by 2-4 times
Enhanced Scalability: Supports wide-range FFT processing from 32 points to 512K points

Main Research Directions

Pipeline FFT Architecture: Represented by Groginsky & Works (1970), pursuing high throughput
Memory-Based FFT Architecture: Aimed at reducing hardware resources, suitable for area-constrained applications
High-Radix FFT Algorithms: Radix- $2^k$ algorithms balance computational complexity and hardware implementation difficulty

Advantages of This Work

Architecture Fusion: First to achieve adaptive switching between pipeline and memory-based architectures
Radix Extension: Supports up to radix- $2^5$ , exceeding existing radix- $2^3$ limitations
Theoretical Completeness: Provides a universal bit-permutation theoretical framework

Conclusions and Discussion

Main Conclusions

The adaptive hybrid architecture successfully combines advantages of both pipeline and memory-based architectures
Radix- $2^5$ MDC design significantly improves computational efficiency for large-scale FFTs
The universal bit-permutation method provides theoretical guarantees for different configurations
Experimental validation demonstrates significant improvements in hardware utilization and computational efficiency

Limitations

Scope Limitation: Memory-based mode only applicable for N ∈ (2^{2k}, 2^{3k}]
Hardware Complexity: Supporting multiple radices increases control logic complexity
Missing Power Analysis: Lacks detailed power consumption comparison analysis

Future Directions

Extend support for even larger-scale FFT processing
Optimize power efficiency
Explore applications in AI accelerators

In-Depth Evaluation

Strengths

Strong Innovation: First to propose adaptive hybrid FFT architecture, resolving inherent contradictions in traditional architectures
Theoretical Completeness: Provides comprehensive bit-permutation theoretical framework with strong generality
Sufficient Experimentation: Validates methodology effectiveness from theoretical analysis to FPGA implementation
High Practical Value: Supports 512K-point FFT, meeting modern large-scale data processing demands

Weaknesses

Increased Complexity: Adaptive mechanisms increase design complexity and verification difficulty
Incomplete Comparison: Lacks performance comparison with latest commercial FFT IP cores
Missing Power Analysis: Power consumption is an important consideration factor in mobile and embedded applications

Impact

Academic Contribution: Provides new architectural paradigm for FFT processor design
Engineering Value: Directly applicable to 5G communications, radar signal processing, and other domains
Reproducibility: Provides detailed design parameters and implementation details

Applicable Scenarios

High-Performance Computing: Scientific computing applications requiring large-scale FFT processing
Communication Systems: Signal processing units in 5G/6G base stations
Radar Systems: Real-time signal processing and target detection
Image Processing: Frequency-domain analysis of high-resolution images

References

The paper cites 17 relevant references covering FFT algorithms, FPGA implementations, memory access optimization, and other aspects, providing solid theoretical foundation for the research.

Overall Assessment: This is a high-quality computer architecture paper with significant theoretical and practical value in the field of FFT processor design. Through ingenious architectural design and rigorous theoretical analysis, the authors successfully address inherent problems in traditional FFT architectures, providing new insights and directions for the field's development.

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix-$2^k$ FFT in Large Size Processing

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix-2k2^k2k FFT in Large Size Processing

Adaptive Hybrid FFT: A Novel Pipeline and Memory-Based Architecture for Radix- $2^k$ FFT in Large Size Processing