Implementation of CCFD-KSSolver Component for GPU Architecture

doi:10.11871/jfdc.issn.2096-742X.2024.01.007

Abstract

Abstract:

[Application Background] In high-performance applications such as computational fluid dynamics and material science, the efficiency and accuracy will be directly affected by the solution of large sparse linear equations. Heterogeneous many-core has become an important feature of modern supercomputing architecture and will be the future trend. [Methods] The linear solver component CCFD-KSSolver is designed and implemented for a CPU+GPU heterogeneous supercomputing system. The component implements the Krylov subspace solver for the multi-physical field block structure matrix and a variety of typical preconditioners. Optimization techniques such as computation-communication overlap, GPU memory access optimization, and CPU-GPU collaborative computing are used to improve the computational efficiency of the CCFD-KSSolver. [Results] Experimental results show that when the number of subdomains is 8, Block-ISAI achieves a speedup of 20.09×and 3.34×compared with CPU and cuSPARSE subdomain solvers, respectively, and has better scalability. For million-level matrices, the parallel efficiency of the three subdomain solvers of KSSolver on eight GPUs is 83.8%, 55.7%, and 87.4%, respectively. [Conclusions] The application of classical multi-physics with block structure is selected to test the solver and preconditioning components. The results show that the solver is stable and efficient, which strongly supports the development of high-performance computing and applications on heterogeneous systems.

Key words: GPU, KSSolver, parallel optimization, preconditioner, high-performance computing

ZHANG Haoyuan, MA Wenpeng, YUAN Wu, ZHANG Jian, LU Zhonghua. Implementation of CCFD-KSSolver Component for GPU Architecture[J]. Frontiers of Data and Computing, 2024, 6(1): 68-78, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.01.007.

Figures/Tables 10

Fig.1

Fig.2

Table 1

Interface design of KSSolver"

功能描述	接口
Krylov子空间方法求解 $A x = b$	Block-GMRES(Matrix M)
RAS(Restricted Additive Schwarz) 预条件	RAS_Preconditioning(Preconditioner M, Vector global_in, Vector global_out)
Block Incomplete LU 分解	CPU_PBILU(Matrix M, LowerFactor L, UpperFactor U)
Block Incomplete LU 分解	cuSPARSE_PBILU(Preconditioner M, LowerFactor L, UpperFactor U)
子区域求解器	FB_SDSolver(Preconditioner local_M, Vector local_in, Vector local_out)
	cuSPARSE_SDSolver(Preconditioner local_M, Vector local_in, Vector local_out)
	BISAI_SDSolver(Preconditioner local_M, Vector local_in, Vector local_out)

Table 1

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

Table 2

Table 3

References 27

[1]	WILLIAM H. Numerical recipes in C++: the art of scientific computing (第2版)[M]. New York: Cambridge University Press, 2002: 234-236.
[2]	ANDERSON J. Computational Fluid Dynamics[M](第1版). ‎MHS, 1995: 325-340.
[3]	谷同祥, 安恒斌. 迭代方法和预处理技术[M](第4版). 北京: 科学出版社, 2004: 79-80.
[4]	刘夏真. 并行流场软件-CCFDv3.0设计及面向国产异构平台的实现[D]. 北京: 中国科学院大学, 2021.
[5]	LAPACK-Linear Algebra PACKage[EB/OL]. [2020-8-6]. http://www.netlib.org/lapack/.
[6]	MKL-Intel, Math Kernel Library[EB/OL]. [2020-7-6]. https://www.osc.edu/book/export/.
[7]	ROCmSoftwarePlatform/rocALUTION[EB/OL]. [2022-4-23]. https://github.com/ROCmSoftwarePlatform/roc-ALUTION.
[8]	HYPRE: Scalable Linear Solvers and Multigrid Methods[EB/OL]. [2020-2-19]. https://computing.llnl.gov/projects/hypre-scalable-linear-solvers-multigrid-methods.
[9]	Trilinos Home Page[EB/OL]. [2019-5-23]. https://trilinos.github.io/.
[10]	PETSc, the Portable Extensible Toolkit for Scientific Computation[EB/OL]. [2021-12-19]. https://petsc.org/release/.
[11]	汪云婷. 面向分布式异构众核计算系统的稀疏矩阵解法器库[D]. 北京: 中国科学院大学, 2020.
[12]	PanguLU, an open source software package that uses a block sparse structure to solve linear systems[EB/OL]. [2021-10-19]. https://gitee.com/ssslab/pangulu.
[13]	MA W P, YUAN W, LIU X Z. A Comparative Study of Block Incomplete Sparse Approximate Inverses Preconditioning on Tesla K20 and V100 GPUs[J]. Algorithms, 2021, 14(7): 204-225. doi: 10.3390/a14070204
[14]	WILLIAMS S. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms[J]. Parallel Computing, 2016, 46(1): 22-36.
[15]	CAI X C, SARKIS M. A restricted additive Schwarz preconditioner for general sparse linear systems[J]. Siam journal on scientific computing, 1999, 21(2): 792-799. doi: 10.1137/S106482759732678X
[16]	KIM S W, YUN J H. Block ILU factorization preconditioners for a block-tridiagonal H matrix[J]. Linear Algebra and its Applications, 2000, 37(3): 103-125. doi: 10.1016/0024-3795(81)90171-3
[17]	LUO L X, EDWARDS J R, LUO H. A fine-grained block ILU scheme on regular structures for GPGPUs[J]. Computers &Fluids, 2015, 119(2): 149-161. doi: 10.1016/j.compfluid.2015.07.005
[18]	SAAD Y, ZHANG J. BILUTM: a domain-based multilevel block ILUT preconditioner for general sparse matrices[J]. Journal on Matrix Analysis and Applications, 1999, 21(1): 279-299.
[19]	Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores[EB/OL]. [2022-2-8]. https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores.
[20]	cuSPARSE: Basic Linear Algebra for Sparse Matrices on NVIDIA GPUs[EB/OL]. [2022-4-23]. https://developer.nvidia.com/cusparse/.
[21]	MA W P, CAI X C. Point-block incomplete LU preconditioning with asynchronous iterations on GPU for multiphysics problems[J]. The International Journal of High Performance Cpmputing Applications, 2020 (67): 24-35.
[22]	LI R P, SAAD Y. GPU-Accelerated Preconditioned Iterative Linear Solvers[J]. The Journal of Supercomputing, 2013, 63(2): 443-466. doi: 10.1007/s11227-012-0825-3
[23]	MA W P, HU Y W, YUAN W, et al. GPU Preconditioning for Block Linear Systems Using Block Incomplete Sparse Approximate Inverses[J]. Mathematical Problems in Engineering, 2021, 205(1):75-88.
[24]	ANZT H, HUCKLE T K, BRÄCKLE. Incomplete Sparse Approximate Inverses for Parallel Preconditioning[J]. Parallel Computing, 2018, 71(1): 22-36.
[25]	BERTACCINI D, FILIPPONE S. Sparse approximate inverse preconditioning algorithm on GPU[J]. Concurrency and Computation Practice and Experience, 2013, 71(3): 693-715.
[26]	NAUMOV M. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU[J]. NVIDIA Technical Report, 2011, 85(9): 196-216.
[27]	ANZT H, HEUVELINE V. Mixed Precision Iterative Refinement Methods for Linear Systems: Convergence Analysis Based on Krylov Subspace Methods[J]. Applied Parallel and Scientific Computing, 2010, 255(9): 52-65.

Tesla Product	Tesla V100
GPU	GV100
Core Clock	1530MHz
Streaming Processors	5120
Memory Size	4096-bit HBM2
Memory Interface	16GB
Peak FP32 TFLOPS	15.7
Peak FP64 TFLOPS	7.8
Shared Memory Size	Configurable up to 96KB
Register FileSize/SM	256KB

进程数	CPU(MPI)		CPU+GPU (cuSPARSE 预条件)		CPU+GPU (Block-ISAI 预条件)
N	迭代步数	迭代时间（ms）	迭代步数	迭代时间（ms）	迭代步数	迭代时间（ms）
2	423	123,064	423	13,622	738	6,387
4	490	73,574	490	11,124	712	3,148
8	460	36,680	459	6,099	722	1,825