基于GPU的FFT高性能算法库的实现和优化

doi:10.11871/jfdc.issn.2096-742X.2025.06.012

摘要/Abstract

摘要：

【目的】本研究旨在优化GPU上FFT算法库的性能，并填补国产GPU高性能FFT算法库的空缺。【方法】主要采取的优化策略如下：一是基于GPU的并行优势，充分利用DFT的数学特性并提出分块处理和层级化计算方案。二是提出了一种去除位元反转的宽度优先与深度优先相结合的新型蝶形网络结构。三是针对多批次数据，采用共享内存和分块处理策略。【结果】与CPU端FFTW库对比，在大规模数据上加速比在2以上。与业内先进的clFFT库相比，在128和256批次的小规模数据上，2的幂次规模的平均加速比为1.47、1.58，非2的幂次规模的平均加速比为3.58、4.07。对于大规模数据，2的幂次规模的平均加速比为2.04、2.38，非2的幂次规模的平均加速比为5.39，5.28。【结论】实验表明，GPU在处理大规模数据时性能显著优于CPU，且PerfFFT在不同规模数据上性能均优于clFFT，验证了优化策略的有效性。

关键词: 快速傅里叶变换, 图形处理单元, 开放计算语言, 并行计算

Abstract:

[Objective] This research primarily aims to address the computational and memory access performance bottlenecks in GPU-based FFT implementations, optimize the performance of FFT algorithm libraries, and bridge the gap in high-performance FFT algorithm libraries for domestic GPUs. [Methods] The optimization strategies adopted in this study are as follows: Firstly, leveraging GPU’s parallel computing advantages and fully utilizing DFT’s mathematical characteristics, we developed a block processing and hierarchical computation scheme that optimizes the computational flow of FFT butterfly operations to achieve high performance. Secondly, targeting GPU’s hierarchical memory architecture and memory access patterns, we proposed a novel bit-reversal-free hybrid butterfly network structure combining depth-first and breadth-first approaches. By optimizing memory access patterns and improving data scheduling, this strategy reduces memory access conflicts, better utilizes GPU cache structures and shared memory, while minimizing global memory accesses. Thirdly, for multi-batch data processing, we implemented shared memory and block processing strategies that enable GPUs to parallelize multiple FFT tasks within the same computational cycle. [Results] Compared with the CPU-based FFTW library, the speedup ratio exceeds 2 for large-scale data. Additionally, by comparing with the industry-leading open-source clFFT library, PerfFFT achieves average speedup ratios of 1.47 and 1.58 for smaller-scale data with power-of-two input sizes, and 3.58 and 4.07 for non-power-of-two input sizes, when the batch size is 128 and 256, respectively. For large-scale data, the average speedup ratios are 2.04 and 2.38 for power-of-two input sizes, and 5.39 and 5.28 for non-power-of-two input sizes. [Conclusion] The experimental results demonstrate that GPUs achieve substantially higher performance than CPUs for large-scale data processing and the PerfFFT library exhibits superior performance compared to the clFFT library when processing input data of various sizes, verifying the effectiveness of the series of optimization strategies proposed in this study. These findings provide valuable insights and references for achieving high-performance FFT implementations on GPUs.

Key words: FFT, GPU, OpenCL, parallel computing

杜振鹏,徐建良,张先轶,黄强. 基于GPU的FFT高性能算法库的实现和优化[J]. 数据与计算发展前沿, 2025, 7(6): 124-135.

DU Zhenpeng,XU Jianliang,ZHANG Xianyi,HUANG Qiang. Implementation and Optimization of High-Performance FFT Algorithm Library Based on GPU[J]. Frontiers of Data and Computing, 2025, 7(6): 124-135, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.06.012.

图/表 14

图1

表1

图2

图3

图4

图5

表2

表3

表4

图6

图7

图8

图9

表5

参考文献 29

[1]	JOSHI S M. FFT architectures: a review[J]. International Journal of Computer Applications, 2015, 116(7).
[2]	RAO K R, KIM D N, HWANG J J. Fast Fourier Transform: Algorithms and Applications[M]. Springer Science & Business Media, 2011.
[3]	KUMAR M A, CHAKRAPANI A. Classification of ECG signal using FFT based improved Alexnet classifier[J]. PLOS ONE, 2022, 17(9): e0274225. doi: 10.1371/journal.pone.0274225
[4]	VAN N N H, DO P H, HOANG V N, et al. Leveraging FFT and hybrid EfficientNet for enhanced action recognition in video sequences[C]// Proceedings of the 12th International Symposium on Information and Co mmunication Technology, 2023: 32-39.
[5]	李亚美, 陈莉丽, 王锋, 等. 基于异构编程模型的FFT算法实现和优化[J]. 智能安全, 2023, 2(4): 24-34.
[6]	COOLEY J W, TUKEY J W. An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of Computation, 1965, 19(90): 297-301. doi: 10.1090/mcom/1965-19-090
[7]	LU Q, WANG X, MA W, et al. GFFT: A task graph based fast Fourier transform optimization framework[C]// Proceedings of the 52nd International Conference on Parallel Processing, 2023: 513-523.
[8]	DE DINECHIN B D, HASCOëT J, DESRENTES O. InPlace Multicore SIMD Fast Fourier Transforms[C]// 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2023: 1-6.
[9]	HU Y, LU L, LI C. Memory-accelerated parallel method for multidimensional fast Fourier implementation on GPU[J]. The Journal of Supercomputing, 2022, 78(16): 18189-18208. doi: 10.1007/s11227-022-04570-9
[10]	HAO Y, LIU F, MA W, et al. MFFT: A GPU accelerated highly efficient mixed-precision large-scale FFT framework[J]. ACM Transactions on Architecture and Code Optimization, 2023, 20(3): 1-23.
[11]	PISHA L, LIGOWSKI Ł. Accelerating non-power-of-2 size Fourier transforms with GPU tensor cores[C]// 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021: 507-516.
[12]	NVIDIA Corporation. cufft documentation v12.5.0[M/OL]. 2023. [2024-11-05]. https://developer.nvidia.com/cufft/archive/12.5.0/cufft/index.html.
[13]	ZHANG Z, HU N, ZHOU L. An efficient multi-step parallel fft algorithm on gpu[C]// International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2024): volume 13403. SPIE, 2024: 66-72.
[14]	NVIDIA Corporation. Cuda c programming guide[M/OL]. NVIDIA Corporation, 2024. [20241101]. https://docs.nvidia.com/cuda/cudacprogrammingguide/index.html.
[15]	AMD. rocfft documentation 1.0.31[EB/OL]. 2024. [2-0241105]. https://rocm.docs.amd.com/projects/roc-FFT/en/latest/index.html.
[16]	AMD. Hip documentation[EB/OL]. 2024. [202411-05]. https://rocm.docs.amd.com/projects/HIP/en/lat est/index.html.
[17]	clMath Libraries. clfft: Opencl fast fourier transforms[EB/OL]. 2016.[20241105]. https://clmathlibraries.github.io/clFFT/.
[18]	FRIGO M, JOHNSON S G. Fftw:The fastest fourier transform in the west[EB/OL]. [20241105]. https://www.fftw.org/.
[19]	GROUP K. OpenCL documentation[EB/OL]. [2024-11-05]. https://www.khronos.org/opencl/.
[20]	LI B, CHENG S, LIN J. tcfft: A fast half-precision FFT library for NVIDIA tensor cores[C]// 2021 IEEE International Conference on Cluster Computing (CL USTER). IEEE, 2021: 1-11.
[21]	TOLMACHEV D. VkFFTa performant, crossplatform and open-source GPU FFT library[J]. IEEE Access, 2023, 11: 12039-12058. doi: 10.1109/ACCESS.2023.3242240
[22]	VIZCAINO P, MANTOVANI F, FERRER R, et al. Acceleration with long vector architectures: Impleme-ntation and evaluation of the FFT kernel on NEC SX-Aurora and RISC-V vector extension[J]. Concurrency and Computation: Practice and Experience, 2023, 35(20): e7424. doi: 10.1002/cpe.v35.20
[23]	贾珍珍, 杨凌, 黄立波, 等. 开源GPU研究综述[J]. 小型微型计算机系统, 2024, 45(9): 2294-2304.
[24]	ROSENFELD V, BREß S, MARKL V. Query processing on heterogeneous CPU/GPU systems[J]. ACM Computing Surveys (CSUR), 2022, 55(1): 1-38.
[25]	DALLY W J, KECKLER S W, KIRK D B. Evolution of the graphics processing unit (gpu)[J]. IEEE Micro, 2021, 41(6): 42-51.
[26]	JEON H, RAVI G S, KIM N S, et al. Gpu register file virtualization[C]// Proceedings of the 48th International Symposium on Microarchitecture, 2015: 420-432.
[27]	DASHTI M, FEDOROVA A. Analyzing memory ma-nagement methods on integrated CPU-GPU systems[C]// Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, 2017: 59-69.
[28]	KIM D H. Evaluation of the performance of GPU global memory coalescing[J]. Evaluation, 2017, 4(4): 1-5.
[29]	赵翔, 贾海鹏, 张云泉, 等. 基于ARMv8处理器的实数FFT实现与性能优化研究[J]. 计算机学报, 2023, 46(5): 1003-1018.

基	阶段数目	同步次数	计算粒度	连续内存访问长度	寄存器个数
2	8	8	小	1024Bytes	4
4	4	4	大	512Bytes	8
16	2	1	大	128Bytes	32

Hardware	Configuration
CPU	Phytium FT-2000/4@2.6 GHz
GPU	景嘉微JM9230@1500 MHz
显存	4 GB
共享内存	32 KB
OpenCL	3.0
GCC	13.1.0
clFFT	2.0
FFTW	3.3.10

批次	clFFT(ms)	PerfFFT(ms)	加速比
1	1.14123	0.90912	1.2553
2	1.13740	0.91310	1.2456
4	1.14535	1.00106	1.1441
8	1.15764	0.82594	1.4016
16	1.12110	0.93636	1.1973
32	1.14706	0.94248	1.2171
64	1.02652	0.74627	1.3755
128	1.01930	0.82706	1.2324
256	1.02703	0.98815	1.0393
512	1.03991	0.56443	1.8424
1,024	1.04316	0.92120	1.1324
2,048	1.12110	1.07124	1.0465
4,096	1.46160	0.95479	1.5308
8,192	2.63448	1.28138	2.0560
16,384	5.11076	2.28341	2.2382

规模	GPU PerfFFT（ms）	CPU FFTW（ms）	加速比
256	0.6377	0.0669	0.1049
512	0.4908	0.1091	0.2224
1,024	0.6640	0.2361	0.3556
2,048	1.1347	0.5426	0.4782
4,096	1.9746	2.2921	1.1608
8,192	3.1945	6.8923	2.1575
16,384	5.7951	17.2347	2.9740
32,768	13.3553	36.5582	2.7373
65,536	27.8472	83.0932	2.9839
131,072	85.4049	181.1213	2.1207
262,144	184.8835	664.8009	3.5957
524,288	376.7236	1,816.3510	4.8214

规模	批次：128 （GFLOPS）	批次：256 （GFLOPS）	批次：512 （GFLPOS）
128	1.36425	3.04158	7.25379
256	3.01587	8.22155	32.29310
512	8.46080	22.69992	38.41876
1,024	20.99017	35.53175	69.69964
2,048	19.34377	43.89296	55.22402
4,096	39.77871	53.10321	60.49311
8,192	46.34932	68.93128	84.50218
16,384	70.84720	79.61441	85.88993