Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (6): 124-135.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.06.012

doi: 10.11871/jfdc.issn.2096-742X.2025.06.012

• Technology and Application • Previous Articles     Next Articles

Implementation and Optimization of High-Performance FFT Algorithm Library Based on GPU

DU Zhenpeng1,*(),XU Jianliang1,ZHANG Xianyi2,HUANG Qiang2   

  1. 1. Ocean University of China, Qingdao, Shandong 266000, China
    2. PerfXLab Technologies Co., Ltd, Beijing 100080, China
  • Received:2025-03-27 Online:2025-12-20 Published:2025-12-17
  • Contact: DU Zhenpeng E-mail:dzp@stu.ouc.edu.cn

Abstract:

[Objective] This research primarily aims to address the computational and memory access performance bottlenecks in GPU-based FFT implementations, optimize the performance of FFT algorithm libraries, and bridge the gap in high-performance FFT algorithm libraries for domestic GPUs. [Methods] The optimization strategies adopted in this study are as follows: Firstly, leveraging GPU’s parallel computing advantages and fully utilizing DFT’s mathematical characteristics, we developed a block processing and hierarchical computation scheme that optimizes the computational flow of FFT butterfly operations to achieve high performance. Secondly, targeting GPU’s hierarchical memory architecture and memory access patterns, we proposed a novel bit-reversal-free hybrid butterfly network structure combining depth-first and breadth-first approaches. By optimizing memory access patterns and improving data scheduling, this strategy reduces memory access conflicts, better utilizes GPU cache structures and shared memory, while minimizing global memory accesses. Thirdly, for multi-batch data processing, we implemented shared memory and block processing strategies that enable GPUs to parallelize multiple FFT tasks within the same computational cycle. [Results] Compared with the CPU-based FFTW library, the speedup ratio exceeds 2 for large-scale data. Additionally, by comparing with the industry-leading open-source clFFT library, PerfFFT achieves average speedup ratios of 1.47 and 1.58 for smaller-scale data with power-of-two input sizes, and 3.58 and 4.07 for non-power-of-two input sizes, when the batch size is 128 and 256, respectively. For large-scale data, the average speedup ratios are 2.04 and 2.38 for power-of-two input sizes, and 5.39 and 5.28 for non-power-of-two input sizes. [Conclusion] The experimental results demonstrate that GPUs achieve substantially higher performance than CPUs for large-scale data processing and the PerfFFT library exhibits superior performance compared to the clFFT library when processing input data of various sizes, verifying the effectiveness of the series of optimization strategies proposed in this study. These findings provide valuable insights and references for achieving high-performance FFT implementations on GPUs.

Key words: FFT, GPU, OpenCL, parallel computing