数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (5): 16-27.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.05.002

doi: 10.11871/jfdc.issn.2096-742X.2025.05.002

• 专刊:国产算力新力量,助力科学计算应用新发展 • 上一篇    下一篇

多GPU平台上三维格子Boltzmann方法的并行化实现

向星1(),孙培杰1,2,张华海1,王利民1,3,*()   

  1. 1.中国科学院过程工程研究所,介科学与工程全国重点实验室,北京 100190
    2.中国石油大学(北京),化学工程与环境学院,北京 102249
    3.中国科学院大学,化学工程学院,北京 100049
  • 收稿日期:2025-02-26 出版日期:2025-10-20 发布日期:2025-10-23
  • 通讯作者: 王利民
  • 作者简介:向星,中国科学院过程工程研究所,博士后,主要研究方向为格子Bolt-zmann方法、湍流/转捩模拟、大规模并行计算。
    本文承担工作为:并行算法设计与优化,算例设置及性能测试。
    XIANG Xing is a postdoctoral rese-archer at the Institute of Process Engineering, Chinese Ac-ademy of Sciences. His main research interests include lattice Boltzmann methods, turbulence/transition simulations, and large-scale parallel computing.
    In this paper, he is mainly responsible for parallel algorithm design and optimization, case setup, and performance testing.
    E-mail: xxiang@ipe.ac.cn|王利民,博士,中国科学院过程工程研究所,研究员,博士生导师,介科学研究部主任,中国科学院大学岗位教授。主要从事湍流与多相流、介科学、工业仿真软件研究等。
    本文承担工作为:指导优化模型和模型设计。
    WANG Limin, Ph.D., is a Professor at the Institute of Process Engineering, Chinese Academy of Sciences. He is a Doctoral Supervisor, Director of the Mesoscience Research Department, and a Chair Professor at the University of Chinese Academy of Sciences. His research interests incl-ude turbulence and multiphase flow, mesoscale science, and industrial simulation software development.
    In this paper, he is mainly responsible for guiding the optimization and design of computational models.
    E-mail: lmwang@ipe.ac.cn
  • 基金资助:
    国家自然科学基金(52476162);光合基金A类项目(202302015420);中国科学院战略性先导研究专项(XDA0390501);过程工程研究所前沿基础研究项目(QYJC-2023-01);国家自然科学基金重点项目(T2394501)

Parallel Implementation of Three-Dimensional Lattice Boltzmann Method on Multi-GPU Platforms

XIANG Xing1(),SUN Peijie1,2,ZHANG Huahai1,WANG Limin1,3,*()   

  1. 1. State Key Laboratory of Mesoscience and Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China
    2. College of Chemical Engineering and Environment, China University of Petroleum Beijing, Beijing 102249, China
    3. School of Chemical Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2025-02-26 Online:2025-10-20 Published:2025-10-23
  • Contact: WANG Limin

摘要:

【目的】 针对大规模科学计算问题,计算范式的转变推动了通用图形处理器的发展,在计算流体力学领域新兴的格子Boltzmann方法在耦合先进物理模型时具有内在的计算效率和并行可扩展性的显著优势。【方法】 本研究基于标准格子模型D3Q19,考虑三维区域分解和分布式数据通信方法,对三维格子Boltzmann方法进行了并行算法设计与优化。【结果】 在某国产异构加速计算平台,对三维流动基准算例进行了不同网格规模下数值验证和精度测试,实现了高保真度瞬态模拟,并捕捉了不同时刻下三维涡结构的非定常演化。在单卡不同网格规模的性能测试中,在正确性验证的基础上,讨论了数据通信部分对并行性能的影响,并给出了单卡对于单核的加速比。在强/弱扩展性测试中,设置了单节点单卡和单节点四卡两组对照数值实验来研究节点间/节点内数据通信的差异。其中单节点单卡组最大计算网格规模约为21.5亿,使用了128节点上总计128张加速卡,运行时间为262.119 s,并行性能为81.927 GLUPS(每秒十亿格点更新,1 GLUPS=103 MLUPS),并行效率为94.76%;单节点四卡组最大计算网格规模约为85.9亿,使用了128节点上总计512张加速卡,并行性能为241.185 GLUPS,并行效率为69.71%。【结论】 本研究提出的并行化实现方法具有线性加速比和良好的并行可扩展性,展示了在E级超算系统上实现高效模拟的潜力。

关键词: 图形处理器, 格子Boltzmann方法, 扩展性测试, 大规模并行计算, 三维Taylor-Green涡流

Abstract:

[Objective] The shift in computational paradigms driven by large-scale scientific computing problems has propelled the development of general-purpose graphics processing units (GPGPU). The emerging lattice Boltzmann method in computational fluid dynamics (CFD) demonstrates significant advantages in computational efficiency and parallel scalability when coupled with advanced physical models. [Methods] This study designs and optimizes a parallel algorithm for the three-dimensional lattice Boltzmann method (D3Q19), considering three-dimensional domain decomposition and distributed data communication. [Results] Numerical verification and accuracy tests were conducted on three-dimensional flow benchmark cases with different grid scales on a domestic heterogeneous acceleration computing platform. High-fidelity transient simulations were achieved, capturing the unsteady evolution of three-dimensional vortex structures at different time steps. In performance tests with a single GPU at different grid scales, the impact of data communication on parallel performance was discussed. In strong/weak scalability tests, two sets of control experiments were conducted: single-node single-GPU and single-node four-GPU setups, to investigate the differences in inter-node/intra-node data communication. The single-node single-GPU setup achieved a maximum computational grid scale of approximately 2.15 billion, using a total of 128 GPUs across 128 nodes, with a runtime of 262.119 seconds, parallel performance of 81.927 GLUPS (Giga Lattice Updates Per Second, 1 GLUPS = 103 MLUPS), and parallel efficiency of 94.76%. The single-node four-GPU setup reached a maximum computational grid scale of approximately 8.59 billion, using 512 GPUs across 128 nodes, with parallel performance of 241.185 GLUPS and parallel efficiency of 69.71%. [Conclusions] The parallel implementation method proposed in this study achieves linear speedup and good parallel scalability, demonstrating the potential for efficient simulation on exascale supercomputing systems.

Key words: graphics processing unit, lattice Boltzmann method, scalability testing, large scale parallel computing, three-dimensional Taylor-Green vortex flow