面向GPU架构的CCFD-KSSolver组件设计和实现

doi:10.11871/jfdc.issn.2096-742X.2024.01.007

数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (1): 68-78.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.01.007

doi: 10.11871/jfdc.issn.2096-742X.2024.01.007

面向GPU架构的CCFD-KSSolver组件设计和实现

张浩源^1,²(),马文鹏^3,^*(),袁武^1,²,张鉴^1,²,陆忠华^1,²

1.中国科学院计算机网络信息中心，北京 100083
2.中国科学院大学，北京 100049
3.信阳师范学院，河南信阳 464000

收稿日期:2022-09-19 出版日期:2024-02-20 发布日期:2024-02-21
通讯作者: * 马文鹏（E-mail: mawp@xynu.edu.cn）
作者简介:张浩源，中国科学院计算机网络信息中心，博士研究生，主要研究方向为稀疏线性解法器、计算流体力学。
本文承担工作为：KSSolver软构件设计与性能测试。ZHANG Haoyuan is a Ph.D. candidate at CNIC. His main research interests include sparse linear solvers and computational fluid dynamics.
In this paper, he is mainly responsible for KSSolver software component design and performance tests.
E-mail: zhanghaoyuan@cnic.cn|马文鹏，信阳师范学院，副教授，主要研究方向为数值并行计算、高性能计算。
本文承担工作为：指导KSSolver软构件设计与算法开发。MA Wenpeng, Ph.D., is an associate professor at Xinyang Normal University. His main research interests include grid computing and high-performance computation.
In this paper, he is mainly responsible for guiding KSSolver program design and algorithm development.
E-mail: mawp@xynu.edu.cn
基金资助:
国家重点研发计划资助(2020YFB1709500);河南省重点研发与推广专项(222102210162)

Implementation of CCFD-KSSolver Component for GPU Architecture

ZHANG Haoyuan^1,²(),MA Wenpeng^3,^*(),YUAN Wu^1,²,ZHANG Jian^1,²,LU Zhonghua^1,²

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
2. University of Chinese Academy of Sciences, Beijing 100049, China
3. Xinyang Normal University, Xinyang, Henan 464000, China

Received:2022-09-19 Online:2024-02-20 Published:2024-02-21

摘要/Abstract

摘要：

【应用背景】在如计算流体力学和材料科学等高性能应用领域中，大型稀疏线性方程的求解直接影响高性能应用的效率与精度。异构众核已成为现代超算系统体系结构的重要特征和发展趋势。【方法】本文面向CPU+GPU异构超算系统设计并实现了线性解法器组件CCFD-KSSolver。该组件针对异构体系结构特征，实现了针对多物理场块结构矩阵的Krylov子空间解法器和多种典型预处理方法，采用了如计算通信重叠、GPU访存优化、CPU-GPU协同计算等优化技术提升CCFD-KSSolver的计算效率。【结果】顶盖驱动流的实验表明，当子区域数目为8时，Block-ISAI相比于CPU和cuSPARSE的子区域求解器分别取得20.09倍和3.34倍的加速比，且具有更好的扩展性；对于百万阶规模的矩阵，应用3种子区域求解器的KSSolver在8个GPU上的并行效率分别为83.8%、55.7%、87.4%。【结论】本文选择具有块结构的经典多物理中的应用对解法器及预处理软构件进行测试，证明其稳定高效性，有力支撑了以流体力学数值模拟为代表的高性能计算与应用在异构系统上的开展。

关键词: GPU, KSSolver, 并行优化, 预条件, 高性能计算

Abstract:

[Application Background] In high-performance applications such as computational fluid dynamics and material science, the efficiency and accuracy will be directly affected by the solution of large sparse linear equations. Heterogeneous many-core has become an important feature of modern supercomputing architecture and will be the future trend. [Methods] The linear solver component CCFD-KSSolver is designed and implemented for a CPU+GPU heterogeneous supercomputing system. The component implements the Krylov subspace solver for the multi-physical field block structure matrix and a variety of typical preconditioners. Optimization techniques such as computation-communication overlap, GPU memory access optimization, and CPU-GPU collaborative computing are used to improve the computational efficiency of the CCFD-KSSolver. [Results] Experimental results show that when the number of subdomains is 8, Block-ISAI achieves a speedup of 20.09×and 3.34×compared with CPU and cuSPARSE subdomain solvers, respectively, and has better scalability. For million-level matrices, the parallel efficiency of the three subdomain solvers of KSSolver on eight GPUs is 83.8%, 55.7%, and 87.4%, respectively. [Conclusions] The application of classical multi-physics with block structure is selected to test the solver and preconditioning components. The results show that the solver is stable and efficient, which strongly supports the development of high-performance computing and applications on heterogeneous systems.

Key words: GPU, KSSolver, parallel optimization, preconditioner, high-performance computing

张浩源, 马文鹏, 袁武, 张鉴, 陆忠华. 面向GPU架构的CCFD-KSSolver组件设计和实现[J]. 数据与计算发展前沿, 2024, 6(1): 68-78.

ZHANG Haoyuan, MA Wenpeng, YUAN Wu, ZHANG Jian, LU Zhonghua. Implementation of CCFD-KSSolver Component for GPU Architecture[J]. Frontiers of Data and Computing, 2024, 6(1): 68-78, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.01.007.

图/表 10

图1

图2

表1

KSSolver功能层接口"

功能描述	接口
Krylov子空间方法求解 $A x = b$	Block-GMRES(Matrix M)
RAS(Restricted Additive Schwarz) 预条件	RAS_Preconditioning(Preconditioner M, Vector global_in, Vector global_out)
Block Incomplete LU 分解	CPU_PBILU(Matrix M, LowerFactor L, UpperFactor U)
Block Incomplete LU 分解	cuSPARSE_PBILU(Preconditioner M, LowerFactor L, UpperFactor U)
子区域求解器	FB_SDSolver(Preconditioner local_M, Vector local_in, Vector local_out)
	cuSPARSE_SDSolver(Preconditioner local_M, Vector local_in, Vector local_out)
	BISAI_SDSolver(Preconditioner local_M, Vector local_in, Vector local_out)

表1

图3

图4

图5

图6

图7

表2

表3

参考文献 27

[1]	WILLIAM H. Numerical recipes in C++: the art of scientific computing (第2版)[M]. New York: Cambridge University Press, 2002: 234-236.
[2]	ANDERSON J. Computational Fluid Dynamics[M](第1版). ‎MHS, 1995: 325-340.
[3]	谷同祥, 安恒斌. 迭代方法和预处理技术[M](第4版). 北京: 科学出版社, 2004: 79-80.
[4]	刘夏真. 并行流场软件-CCFDv3.0设计及面向国产异构平台的实现[D]. 北京: 中国科学院大学, 2021.
[5]	LAPACK-Linear Algebra PACKage[EB/OL]. [2020-8-6]. http://www.netlib.org/lapack/.
[6]	MKL-Intel, Math Kernel Library[EB/OL]. [2020-7-6]. https://www.osc.edu/book/export/.
[7]	ROCmSoftwarePlatform/rocALUTION[EB/OL]. [2022-4-23]. https://github.com/ROCmSoftwarePlatform/roc-ALUTION.
[8]	HYPRE: Scalable Linear Solvers and Multigrid Methods[EB/OL]. [2020-2-19]. https://computing.llnl.gov/projects/hypre-scalable-linear-solvers-multigrid-methods.
[9]	Trilinos Home Page[EB/OL]. [2019-5-23]. https://trilinos.github.io/.
[10]	PETSc, the Portable Extensible Toolkit for Scientific Computation[EB/OL]. [2021-12-19]. https://petsc.org/release/.
[11]	汪云婷. 面向分布式异构众核计算系统的稀疏矩阵解法器库[D]. 北京: 中国科学院大学, 2020.
[12]	PanguLU, an open source software package that uses a block sparse structure to solve linear systems[EB/OL]. [2021-10-19]. https://gitee.com/ssslab/pangulu.
[13]	MA W P, YUAN W, LIU X Z. A Comparative Study of Block Incomplete Sparse Approximate Inverses Preconditioning on Tesla K20 and V100 GPUs[J]. Algorithms, 2021, 14(7): 204-225. doi: 10.3390/a14070204
[14]	WILLIAMS S. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms[J]. Parallel Computing, 2016, 46(1): 22-36.
[15]	CAI X C, SARKIS M. A restricted additive Schwarz preconditioner for general sparse linear systems[J]. Siam journal on scientific computing, 1999, 21(2): 792-799. doi: 10.1137/S106482759732678X
[16]	KIM S W, YUN J H. Block ILU factorization preconditioners for a block-tridiagonal H matrix[J]. Linear Algebra and its Applications, 2000, 37(3): 103-125. doi: 10.1016/0024-3795(81)90171-3
[17]	LUO L X, EDWARDS J R, LUO H. A fine-grained block ILU scheme on regular structures for GPGPUs[J]. Computers &Fluids, 2015, 119(2): 149-161. doi: 10.1016/j.compfluid.2015.07.005
[18]	SAAD Y, ZHANG J. BILUTM: a domain-based multilevel block ILUT preconditioner for general sparse matrices[J]. Journal on Matrix Analysis and Applications, 1999, 21(1): 279-299.
[19]	Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores[EB/OL]. [2022-2-8]. https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores.
[20]	cuSPARSE: Basic Linear Algebra for Sparse Matrices on NVIDIA GPUs[EB/OL]. [2022-4-23]. https://developer.nvidia.com/cusparse/.
[21]	MA W P, CAI X C. Point-block incomplete LU preconditioning with asynchronous iterations on GPU for multiphysics problems[J]. The International Journal of High Performance Cpmputing Applications, 2020 (67): 24-35.
[22]	LI R P, SAAD Y. GPU-Accelerated Preconditioned Iterative Linear Solvers[J]. The Journal of Supercomputing, 2013, 63(2): 443-466. doi: 10.1007/s11227-012-0825-3
[23]	MA W P, HU Y W, YUAN W, et al. GPU Preconditioning for Block Linear Systems Using Block Incomplete Sparse Approximate Inverses[J]. Mathematical Problems in Engineering, 2021, 205(1):75-88.
[24]	ANZT H, HUCKLE T K, BRÄCKLE. Incomplete Sparse Approximate Inverses for Parallel Preconditioning[J]. Parallel Computing, 2018, 71(1): 22-36.
[25]	BERTACCINI D, FILIPPONE S. Sparse approximate inverse preconditioning algorithm on GPU[J]. Concurrency and Computation Practice and Experience, 2013, 71(3): 693-715.
[26]	NAUMOV M. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU[J]. NVIDIA Technical Report, 2011, 85(9): 196-216.
[27]	ANZT H, HEUVELINE V. Mixed Precision Iterative Refinement Methods for Linear Systems: Convergence Analysis Based on Krylov Subspace Methods[J]. Applied Parallel and Scientific Computing, 2010, 255(9): 52-65.

Tesla Product	Tesla V100
GPU	GV100
Core Clock	1530MHz
Streaming Processors	5120
Memory Size	4096-bit HBM2
Memory Interface	16GB
Peak FP32 TFLOPS	15.7
Peak FP64 TFLOPS	7.8
Shared Memory Size	Configurable up to 96KB
Register FileSize/SM	256KB

进程数	CPU(MPI)		CPU+GPU (cuSPARSE 预条件)		CPU+GPU (Block-ISAI 预条件)
N	迭代步数	迭代时间（ms）	迭代步数	迭代时间（ms）	迭代步数	迭代时间（ms）
2	423	123,064	423	13,622	738	6,387
4	490	73,574	490	11,124	712	3,148
8	460	36,680	459	6,099	722	1,825

面向GPU架构的CCFD-KSSolver组件设计和实现

Implementation of CCFD-KSSolver Component for GPU Architecture

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 27

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王玉明,吴开超,牛晨辉,张晓丽. 基于容器化的快速射电暴搜寻GPU并行优化[J]. 数据与计算发展前沿, 2024, 6(1): 102-112.
[2]	徐顺, 张宝花, 刘倩, 金钟. eMD：基于异构计算的大规模分子动力学模拟软件[J]. 数据与计算发展前沿, 2024, 6(1): 21-34.
[3]	赵一宁, 肖海力. 国家高性能计算环境运行状态诊断系统[J]. 数据与计算发展前沿, 2024, 6(1): 57-67.
[4]	张云泉, 袁良, 袁国兴, 李希代. 2023年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2023, 5(6): 1-8.
[5]	杨晨柳, 方安, 王蕾, 王茜, 钱庆. 我国生物医学领域高性能计算发展分析与建议[J]. 数据与计算发展前沿, 2023, 5(6): 104-114.
[6]	危婷, 彭亮, 牛铁, 张宏海. 基于特征分析的HPC失败作业的检测和根因分析[J]. 数据与计算发展前沿, 2023, 5(6): 94-103.
[7]	张新昕,刘夏真,梁姗,张鉴,陆忠华,高凌云,张浩源. 高性能并行CFD软件研发及高速列车气动性能预示[J]. 数据与计算发展前沿, 2023, 5(2): 106-118.
[8]	杨雪莹, 李晨, 陈逸东, 陆忠华. 基于数值方法的养老目标基金的模型与算法综述[J]. 数据与计算发展前沿, 2023, 5(1): 85-96.
[9]	张云泉, 袁良, 袁国兴, 李希代. 2022年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2022, 4(6): 3-12.
[10]	寇大治, 韦建文, 唐小勇. 应用感知的算力优化调度方法[J]. 数据与计算发展前沿, 2022, 4(5): 3-10.
[11]	王小宁,卢莎莎,吴璨,和荣,闫晓婷,肖海力,迟学斌. 基于高性能计算环境的HPC算力编程模式[J]. 数据与计算发展前沿, 2022, 4(5): 33-41.
[12]	石京燕,黄秋兰,汪璐,李海波,杜然,姜晓巍,胡庆宝,郑伟,闫晓飞,张玄同. 国家高能物理科学数据中心分布式数据处理平台[J]. 数据与计算发展前沿, 2022, 4(1): 97-112.
[13]	何连花,赵莲,姜金荣,金钟. 高性能计算数值模拟框架软件研究进展[J]. 数据与计算发展前沿, 2021, 3(6): 108-117.
[14]	卢莎莎,肖海力,王小宁. 容器技术在高性能计算环境中的应用[J]. 数据与计算发展前沿, 2021, 3(6): 118-126.
[15]	张云泉,袁良,袁国兴,李希代. 2021年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2021, 3(6): 98-107.