Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (5): 16-27.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.05.002

doi: 10.11871/jfdc.issn.2096-742X.2025.05.002

• Special Issue: New Domestic Computing Power Empowers the Development of Scientific Computing Applications • Previous Articles     Next Articles

Parallel Implementation of Three-Dimensional Lattice Boltzmann Method on Multi-GPU Platforms

XIANG Xing1(),SUN Peijie1,2,ZHANG Huahai1,WANG Limin1,3,*()   

  1. 1. State Key Laboratory of Mesoscience and Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China
    2. College of Chemical Engineering and Environment, China University of Petroleum Beijing, Beijing 102249, China
    3. School of Chemical Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2025-02-26 Online:2025-10-20 Published:2025-10-23
  • Contact: WANG Limin E-mail:xxiang@ipe.ac.cn;lmwang@ipe.ac.cn

Abstract:

[Objective] The shift in computational paradigms driven by large-scale scientific computing problems has propelled the development of general-purpose graphics processing units (GPGPU). The emerging lattice Boltzmann method in computational fluid dynamics (CFD) demonstrates significant advantages in computational efficiency and parallel scalability when coupled with advanced physical models. [Methods] This study designs and optimizes a parallel algorithm for the three-dimensional lattice Boltzmann method (D3Q19), considering three-dimensional domain decomposition and distributed data communication. [Results] Numerical verification and accuracy tests were conducted on three-dimensional flow benchmark cases with different grid scales on a domestic heterogeneous acceleration computing platform. High-fidelity transient simulations were achieved, capturing the unsteady evolution of three-dimensional vortex structures at different time steps. In performance tests with a single GPU at different grid scales, the impact of data communication on parallel performance was discussed. In strong/weak scalability tests, two sets of control experiments were conducted: single-node single-GPU and single-node four-GPU setups, to investigate the differences in inter-node/intra-node data communication. The single-node single-GPU setup achieved a maximum computational grid scale of approximately 2.15 billion, using a total of 128 GPUs across 128 nodes, with a runtime of 262.119 seconds, parallel performance of 81.927 GLUPS (Giga Lattice Updates Per Second, 1 GLUPS = 103 MLUPS), and parallel efficiency of 94.76%. The single-node four-GPU setup reached a maximum computational grid scale of approximately 8.59 billion, using 512 GPUs across 128 nodes, with parallel performance of 241.185 GLUPS and parallel efficiency of 69.71%. [Conclusions] The parallel implementation method proposed in this study achieves linear speedup and good parallel scalability, demonstrating the potential for efficient simulation on exascale supercomputing systems.

Key words: graphics processing unit, lattice Boltzmann method, scalability testing, large scale parallel computing, three-dimensional Taylor-Green vortex flow