An Effective Implementation of LDLT Decomposition of Sparse Symmetric Matrix on GPU

doi:10.11871/jfdc.issn.2096-742X.2021.03.012

Abstract

Abstract:

[Objective] LDL^T decomposition is an effective tool to solve many problems in sparse symmetric linear systems, especially for those problems which are hard to converge using iterative solvers. However, it is difficult to implement LDL^T on the GPU for data dependency and irregular data access during the factorization. [Methods] In this paper, an effective GPU-based LDL^T decomposition method of sparse symmetric matrix is designed and implemented based on Cholesky symbolic decomposition, right-looking decomposition algorithm and level partition of the dependency graph for the sparse matrix. By using controlled kernel launch for CUDA dynamic parallelism, all three loops of the algorithm are parallelized, so the proposed method can achieve higher parallelism.[Results] Experimental results show that the implementation of LDL^T on GPU can achieve a maximum speedup of 46.2 compared to UMFPACK for a typical collection of sparse symmetric matrix. [Conclusions] CUDA implementation of LDL^T can give reference to high performance numerical algorithm research and implementation for sparse matrix on GPU-based heterogeneous platforms.

Key words: LDL^T decomposition, right-looking algorithm, GPU, dynamic parallelism

Chen Xinfeng,Wang Wu. An Effective Implementation of LDLT Decomposition of Sparse Symmetric Matrix on GPU[J]. Frontiers of Data and Computing, 2021, 3(3): 136-147.

Figures/Tables 19

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Table 2

Fig.5

Fig.6

Fig.7

Fig.8

References 24

[1]	NVIDIA Corporation, CUBLAS library [CP/OL]. http://developer.nvidia.com/cublas.
[2]	Rutherford Appleton Laboratory, the HSL mathematical software library [CP/OL]. http://www.hsl.rl.ac.uk.
[3]	NVIDIA Corporation, CUSPARSE library[CP/OL]. http://developer.nvidia.com/cusparse.
[4]	Peng S, Tan S X. GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation[J]. IEEE Design & Test, 2020, 37(3):78-90.
[5]	Kirk D B, Hwu W W. Programming Massively Parallel Processors: A Hands-on Approach[M]. 2ed, Elsevier Inc., 2013: 1-40.
[6]	Davis T A. Direct Methods for Sparse Linear Systems[M]. SIAM, 2006: 38-59.
[7]	Gilbert J R, Peierls T. Sparse partial pivoting in time proportional to arithmetic operations[J]. SIAM journal on scientific and statistical computing, 1988, 9(5):862-874. doi: 10.1137/0909058
[8]	Parter S. The use of linear graphs in Gauss elimination[J]. SIAM review, 1961, 3(2):119-130. doi: 10.1137/1003021
[9]	Liu J W. A compact row storage scheme for Cholesky factors using elimination trees[J]. ACM Transactions on Mathematical Software (TOMS), 1986, 12(2):127-148. doi: 10.1145/6497.6499
[10]	Rose D J, Tarjan R E, Lueker G S. Algorithmic aspects of vertex elimination on graphs[J]. SIAM Journal on computing, 1976, 5(2):266-283. doi: 10.1137/0205021
[11]	Schreiber R. A new implementation of sparse Gaussian elimination[J]. ACM Trans. Math. Softw., 1982, 8(3):256-276. doi: 10.1145/356004.356006
[12]	Amestoy R P, Davis T A, Duff I S. Algorithm 837: AMD, an Approximate Minimum Degree Ordering Algorithm[J]. ACM Trans. Math. Softw., 2004, 30(3):381-388. doi: 10.1145/1024074.1024081
[13]	METIS-Serial Graph Partitioing and Fill-reducing Matrix Ordering [CP/OL]. http://glaros.dtc.umn.edu/gkhome/METIS.
[14]	Lee W, Achar R, Nakhla M S. Dynamic GPU Parallel Sparse LU Factorization for Fast Circuit Simulation[J]. IEEE Transactions on Very Large Scale Integration, 26(11):2518-2529, 2018. doi: 10.1109/TVLSI.2018.2858014
[15]	Li X S, Demmel J. A Scalable Sparse Direct Solver Using Static Pivoting[C]. Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 22-24, 1999.
[16]	Arioli M, Demmel J W, Duff I S. Solving sparse linear systems with sparse backward error[J]. SIAM Journal on Matrix Analysis and Applications, 1989, 10(2):165-190. doi: 10.1137/0610013
[17]	He K, Tan S, Wang H, Shi G. GPU-accelerated parallel Sparse LU factorization method for fast circuit analysis[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2015, 24(3):1140-1150. doi: 10.1109/TVLSI.2015.2421287
[18]	Sanders J, Kandrot E. CUDA by Example: an Introduction to General-Purpose GPU Program- ming[M]. Addison-Wesley Professional, 2011: 163-184.
[19]	扶月月, 王武, 王乔. 基于FMM-PM方法的宇宙N体模拟在GPU上的实现和优化[J]. 数据与计算发展前沿, 2020, 2(2):155-164.
[20]	张留莹, 王鹏飞, 张峰, 刘海龙, 林鹏飞, 王涛, 韦俊林, 田少博, 姜金荣, 迟学斌. 海洋环流模式LICOM的GPU实现与优化[J]. 数据与计算发展前沿, 2020, 2(4):92-104.
[21]	党冠麟, 刘世伟, 胡晓东, 张鉴, 李新亮. 基于CPU/GPU异构系统架构的高超声速湍流直接数值模拟研究[J]. 数据与计算发展前沿, 2020, 2(1):105-116.
[22]	Cheng J, Grossman M, McKercher T. Professional CUDA C Programming[M]. John Wiley, 2014: 122-131.
[23]	Davis T. the University of Florida Sparse Matrix Collection [CP/OL]. http://sparse.tamu.edu.
[24]	SuiteSparse[CP/OL]. https://people.engr.tamu.edu/davis/suitesparse.html.

matrix	n	nz	nnz	nnz/nz	symbolic	numeric	solve	total
windscreen	22692	752541	5545914	7.370	311.814	973.928	534.656	1820.398
crystk03	24696	887937	9221958	10.386	371.007	693.275	88.267	1152.549
bcsstk37	25503	583240	2996850	5.138	182.664	342.704	29.465	554.833
bcsstk35	30237	740200	3045171	4.114	178.499	354.06	31.053	563.612
t3dh	79171	2215638	45191167	20.396	3311.172	4136.49	440.249	7887.911
TEM152078	152078	3305720	57409931	17.367	4423.406	4893.649	556.43	9873.485
TEM181302	181302	4010156	70510354	17.583	5555.54	6092.029	689.041	12336.61
pwtk	217918	5926171	47124510	7.952	2570.829	2878.245	468.653	5917.727
BenElechi1	245874	6698185	52230259	7.798	2779.673	3193.183	176.464	6095.320

matrix	symbolic	numeric	solve	T(umf)	Sp(ldlt)	Sp(total)
windscreen	603.154	13968.758	94.106	14666.018	14.342	8.056
crystk03	215.40	4951.943	34.575	5201.919	7.143	4.513
bcsstk37	54.54	1447.959	30.678	1533.177	4.224	2.763
bcsstk35	223.709	1661.172	31.031	1915.912	4.692	3.399
t3dh	7468.603	191216.956	443.984	199129.543	46.227	25.245
TEM152078	9288.08	200071.001	566.479	209925.56	40.884	21.262
TEM181302	11430.526	218608.984	288.069	230327.579	35.884	18.670
pwtk	1171.662	21133.883	211.081	22516.626	7.343	3.805
BenElechi1	1129.767	24923.273	211.145	26264.185	7.805	4.309