数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (6): 124-135.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.06.012

doi: 10.11871/jfdc.issn.2096-742X.2025.06.012

• 技术与应用 • 上一篇    下一篇

基于GPU的FFT高性能算法库的实现和优化

杜振鹏1,*(),徐建良1,张先轶2,黄强2   

  1. 1.中国海洋大学,山东 青岛 266000
    2.澎峰科技有限公司,北京 100080
  • 收稿日期:2025-03-27 出版日期:2025-12-20 发布日期:2025-12-17
  • 通讯作者: 杜振鹏
  • 作者简介:杜振鹏,中国海洋大学,硕士研究生,CCF学生会员,研究方向为高性能计算。
    本文主要工作为实验的设计与实现以及论文撰写。
    DU Zhenpeng, is a Master’s candidate at Ocean University of China and a student member of CCF. His research focuses on high-performance computing (HPC).
    In this paper, he is mainly responsible for the experimental design and implementation, as well as the manuscript writing.
    E-mail: dzp@stu.ouc.edu.cn
  • 基金资助:
    基于高性能计算知识图谱的并行软件性能优化方法研究(CARCHA202113)

Implementation and Optimization of High-Performance FFT Algorithm Library Based on GPU

DU Zhenpeng1,*(),XU Jianliang1,ZHANG Xianyi2,HUANG Qiang2   

  1. 1. Ocean University of China, Qingdao, Shandong 266000, China
    2. PerfXLab Technologies Co., Ltd, Beijing 100080, China
  • Received:2025-03-27 Online:2025-12-20 Published:2025-12-17
  • Contact: DU Zhenpeng

摘要:

【目的】本研究旨在优化GPU上FFT算法库的性能,并填补国产GPU高性能FFT算法库的空缺。【方法】主要采取的优化策略如下:一是基于GPU的并行优势,充分利用DFT的数学特性并提出分块处理和层级化计算方案。二是提出了一种去除位元反转的宽度优先与深度优先相结合的新型蝶形网络结构。三是针对多批次数据,采用共享内存和分块处理策略。【结果】与CPU端FFTW库对比,在大规模数据上加速比在2以上。与业内先进的clFFT库相比,在128和256批次的小规模数据上,2的幂次规模的平均加速比为1.47、1.58,非2的幂次规模的平均加速比为3.58、4.07。对于大规模数据,2的幂次规模的平均加速比为2.04、2.38,非2的幂次规模的平均加速比为5.39,5.28。【结论】实验表明,GPU在处理大规模数据时性能显著优于CPU,且PerfFFT在不同规模数据上性能均优于clFFT,验证了优化策略的有效性。

关键词: 快速傅里叶变换, 图形处理单元, 开放计算语言, 并行计算

Abstract:

[Objective] This research primarily aims to address the computational and memory access performance bottlenecks in GPU-based FFT implementations, optimize the performance of FFT algorithm libraries, and bridge the gap in high-performance FFT algorithm libraries for domestic GPUs. [Methods] The optimization strategies adopted in this study are as follows: Firstly, leveraging GPU’s parallel computing advantages and fully utilizing DFT’s mathematical characteristics, we developed a block processing and hierarchical computation scheme that optimizes the computational flow of FFT butterfly operations to achieve high performance. Secondly, targeting GPU’s hierarchical memory architecture and memory access patterns, we proposed a novel bit-reversal-free hybrid butterfly network structure combining depth-first and breadth-first approaches. By optimizing memory access patterns and improving data scheduling, this strategy reduces memory access conflicts, better utilizes GPU cache structures and shared memory, while minimizing global memory accesses. Thirdly, for multi-batch data processing, we implemented shared memory and block processing strategies that enable GPUs to parallelize multiple FFT tasks within the same computational cycle. [Results] Compared with the CPU-based FFTW library, the speedup ratio exceeds 2 for large-scale data. Additionally, by comparing with the industry-leading open-source clFFT library, PerfFFT achieves average speedup ratios of 1.47 and 1.58 for smaller-scale data with power-of-two input sizes, and 3.58 and 4.07 for non-power-of-two input sizes, when the batch size is 128 and 256, respectively. For large-scale data, the average speedup ratios are 2.04 and 2.38 for power-of-two input sizes, and 5.39 and 5.28 for non-power-of-two input sizes. [Conclusion] The experimental results demonstrate that GPUs achieve substantially higher performance than CPUs for large-scale data processing and the PerfFFT library exhibits superior performance compared to the clFFT library when processing input data of various sizes, verifying the effectiveness of the series of optimization strategies proposed in this study. These findings provide valuable insights and references for achieving high-performance FFT implementations on GPUs.

Key words: FFT, GPU, OpenCL, parallel computing