稀疏对称矩阵的LDLT分解在GPU上的高效实现

doi:10.11871/jfdc.issn.2096-742X.2021.03.012

数据与计算发展前沿 ›› 2021, Vol. 3 ›› Issue (3): 136-147.

doi: 10.11871/jfdc.issn.2096-742X.2021.03.012

稀疏对称矩阵的LDL^T分解在GPU上的高效实现

陈鑫峰^1,²(),王武^1,^*()

1.中国科学院计算机网络信息中心,北京 100190
2.中国科学院大学,北京 100049

收稿日期:2021-02-01 出版日期:2021-06-20 发布日期:2021-07-09
通讯作者: 王武
作者简介:陈鑫峰,中国科学院计算机网络信息中心,在读硕士研究生,主要研究方向为高性能计算和并行计算。
本文承担工作：实/复数稀疏对称矩阵的LDL^T分解在GPU上的实现。
Chen Xinfeng is a master student at Computer Network Information Center, Chinese Academy of Sciences. His main research interests are high performance computing and parallel computing.
In this paper, he undertakes the following tasks: the implemen-tation and optimization of LDL^T decomposition of complex/real sparse symmetric matrix on GPU.
E-mail: chenxinfeng@cnic.cn|王武,中国科学院计算机网络信息中心,博士,副研究员,研究方向为并行算法、高性能计算。
本文承担工作为GPU上的稀疏矩阵分解的算法指导。
Wang Wu, Ph.D., is an associate researcher at Computer Network Information Center, Chinese Academy of Sciences. His main research interests are parallel algorithm and high performance computing. In this paper, he undertakes the following tasks: algorithm director of sparse matrix decomposi-tion on GPU.
E-mail: wangwu@sccas.cn
基金资助:
国家重点研发计划项目“复杂电磁环境高性能应用软件系统研制及应用示范”(2017YFB 0202502);中国科学院“十三五”信息化专项“科研信息化应用工程”(XXH13506-405)

An Effective Implementation of LDLT Decomposition of Sparse Symmetric Matrix on GPU

Chen Xinfeng^1,²(),Wang Wu^1,^*()

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
2. University of Chinese Academy of Sciences, Beijing 100049, China

Received:2021-02-01 Online:2021-06-20 Published:2021-07-09
Contact: Wang Wu

摘要/Abstract

摘要：

【目的】 LDL^T分解是求解很多稀疏对称线性系统的有效工具之一,尤其是对于迭代法难以收敛的问题。然而在GPU上实现LDL^T分解存在困难,因为分解过程中存在数据依赖和不规则的数据访问。【方法】 本文设计并实现了一个基于GPU的稀疏对称矩阵的LDL^T分解,它采用Cholesky的符号分解和右视分解算法、稀疏矩阵依赖图的层次划分,以及CUDA的动态并行核调度技术,算法的所有三层循环都并行化,从而获得更高的并行度。【结果】 实验结果表明,针对稀疏对称矩阵的一个典型的测试集,在GPU上实现的LDL^T分解相对于UMFPACK最高加速46.2倍。【结论】 LDL^T分解CUDA实现策略可为高性能GPU异构平台上开展稀疏矩阵的高性能数值算法研究与实现提供借鉴。

关键词: LDL^T 分解, 右视算法, GPU, 动态并行

Abstract:

[Objective] LDL^T decomposition is an effective tool to solve many problems in sparse symmetric linear systems, especially for those problems which are hard to converge using iterative solvers. However, it is difficult to implement LDL^T on the GPU for data dependency and irregular data access during the factorization. [Methods] In this paper, an effective GPU-based LDL^T decomposition method of sparse symmetric matrix is designed and implemented based on Cholesky symbolic decomposition, right-looking decomposition algorithm and level partition of the dependency graph for the sparse matrix. By using controlled kernel launch for CUDA dynamic parallelism, all three loops of the algorithm are parallelized, so the proposed method can achieve higher parallelism.[Results] Experimental results show that the implementation of LDL^T on GPU can achieve a maximum speedup of 46.2 compared to UMFPACK for a typical collection of sparse symmetric matrix. [Conclusions] CUDA implementation of LDL^T can give reference to high performance numerical algorithm research and implementation for sparse matrix on GPU-based heterogeneous platforms.

Key words: LDL^T decomposition, right-looking algorithm, GPU, dynamic parallelism

陈鑫峰,王武. 稀疏对称矩阵的LDL^T分解在GPU上的高效实现[J]. 数据与计算发展前沿, 2021, 3(3): 136-147.

Chen Xinfeng,Wang Wu. An Effective Implementation of LDLT Decomposition of Sparse Symmetric Matrix on GPU[J]. Frontiers of Data and Computing, 2021, 3(3): 136-147.

图/表 19

图1

算法1.

图2

算法2.

图3

图4

算法3.

算法4.

伪代码1.

伪代码2.

伪代码3.

伪代码4.

伪代码5.

表1

表2

图5

图6

图7

图8

参考文献 24

[1]	NVIDIA Corporation, CUBLAS library [CP/OL]. http://developer.nvidia.com/cublas.
[2]	Rutherford Appleton Laboratory, the HSL mathematical software library [CP/OL]. http://www.hsl.rl.ac.uk.
[3]	NVIDIA Corporation, CUSPARSE library[CP/OL]. http://developer.nvidia.com/cusparse.
[4]	Peng S, Tan S X. GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation[J]. IEEE Design & Test, 2020, 37(3):78-90.
[5]	Kirk D B, Hwu W W. Programming Massively Parallel Processors: A Hands-on Approach[M]. 2ed, Elsevier Inc., 2013: 1-40.
[6]	Davis T A. Direct Methods for Sparse Linear Systems[M]. SIAM, 2006: 38-59.
[7]	Gilbert J R, Peierls T. Sparse partial pivoting in time proportional to arithmetic operations[J]. SIAM journal on scientific and statistical computing, 1988, 9(5):862-874. doi: 10.1137/0909058
[8]	Parter S. The use of linear graphs in Gauss elimination[J]. SIAM review, 1961, 3(2):119-130. doi: 10.1137/1003021
[9]	Liu J W. A compact row storage scheme for Cholesky factors using elimination trees[J]. ACM Transactions on Mathematical Software (TOMS), 1986, 12(2):127-148. doi: 10.1145/6497.6499
[10]	Rose D J, Tarjan R E, Lueker G S. Algorithmic aspects of vertex elimination on graphs[J]. SIAM Journal on computing, 1976, 5(2):266-283. doi: 10.1137/0205021
[11]	Schreiber R. A new implementation of sparse Gaussian elimination[J]. ACM Trans. Math. Softw., 1982, 8(3):256-276. doi: 10.1145/356004.356006
[12]	Amestoy R P, Davis T A, Duff I S. Algorithm 837: AMD, an Approximate Minimum Degree Ordering Algorithm[J]. ACM Trans. Math. Softw., 2004, 30(3):381-388. doi: 10.1145/1024074.1024081
[13]	METIS-Serial Graph Partitioing and Fill-reducing Matrix Ordering [CP/OL]. http://glaros.dtc.umn.edu/gkhome/METIS.
[14]	Lee W, Achar R, Nakhla M S. Dynamic GPU Parallel Sparse LU Factorization for Fast Circuit Simulation[J]. IEEE Transactions on Very Large Scale Integration, 26(11):2518-2529, 2018. doi: 10.1109/TVLSI.2018.2858014
[15]	Li X S, Demmel J. A Scalable Sparse Direct Solver Using Static Pivoting[C]. Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 22-24, 1999.
[16]	Arioli M, Demmel J W, Duff I S. Solving sparse linear systems with sparse backward error[J]. SIAM Journal on Matrix Analysis and Applications, 1989, 10(2):165-190. doi: 10.1137/0610013
[17]	He K, Tan S, Wang H, Shi G. GPU-accelerated parallel Sparse LU factorization method for fast circuit analysis[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2015, 24(3):1140-1150. doi: 10.1109/TVLSI.2015.2421287
[18]	Sanders J, Kandrot E. CUDA by Example: an Introduction to General-Purpose GPU Program- ming[M]. Addison-Wesley Professional, 2011: 163-184.
[19]	扶月月, 王武, 王乔. 基于FMM-PM方法的宇宙N体模拟在GPU上的实现和优化[J]. 数据与计算发展前沿, 2020, 2(2):155-164.
[20]	张留莹, 王鹏飞, 张峰, 刘海龙, 林鹏飞, 王涛, 韦俊林, 田少博, 姜金荣, 迟学斌. 海洋环流模式LICOM的GPU实现与优化[J]. 数据与计算发展前沿, 2020, 2(4):92-104.
[21]	党冠麟, 刘世伟, 胡晓东, 张鉴, 李新亮. 基于CPU/GPU异构系统架构的高超声速湍流直接数值模拟研究[J]. 数据与计算发展前沿, 2020, 2(1):105-116.
[22]	Cheng J, Grossman M, McKercher T. Professional CUDA C Programming[M]. John Wiley, 2014: 122-131.
[23]	Davis T. the University of Florida Sparse Matrix Collection [CP/OL]. http://sparse.tamu.edu.
[24]	SuiteSparse[CP/OL]. https://people.engr.tamu.edu/davis/suitesparse.html.

matrix	n	nz	nnz	nnz/nz	symbolic	numeric	solve	total
windscreen	22692	752541	5545914	7.370	311.814	973.928	534.656	1820.398
crystk03	24696	887937	9221958	10.386	371.007	693.275	88.267	1152.549
bcsstk37	25503	583240	2996850	5.138	182.664	342.704	29.465	554.833
bcsstk35	30237	740200	3045171	4.114	178.499	354.06	31.053	563.612
t3dh	79171	2215638	45191167	20.396	3311.172	4136.49	440.249	7887.911
TEM152078	152078	3305720	57409931	17.367	4423.406	4893.649	556.43	9873.485
TEM181302	181302	4010156	70510354	17.583	5555.54	6092.029	689.041	12336.61
pwtk	217918	5926171	47124510	7.952	2570.829	2878.245	468.653	5917.727
BenElechi1	245874	6698185	52230259	7.798	2779.673	3193.183	176.464	6095.320

matrix	symbolic	numeric	solve	T(umf)	Sp(ldlt)	Sp(total)
windscreen	603.154	13968.758	94.106	14666.018	14.342	8.056
crystk03	215.40	4951.943	34.575	5201.919	7.143	4.513
bcsstk37	54.54	1447.959	30.678	1533.177	4.224	2.763
bcsstk35	223.709	1661.172	31.031	1915.912	4.692	3.399
t3dh	7468.603	191216.956	443.984	199129.543	46.227	25.245
TEM152078	9288.08	200071.001	566.479	209925.56	40.884	21.262
TEM181302	11430.526	218608.984	288.069	230327.579	35.884	18.670
pwtk	1171.662	21133.883	211.081	22516.626	7.343	3.805
BenElechi1	1129.767	24923.273	211.145	26264.185	7.805	4.309

稀疏对称矩阵的LDL^T分解在GPU上的高效实现

An Effective Implementation of LDLT Decomposition of Sparse Symmetric Matrix on GPU

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 19

参考文献 24

相关文章 3

编辑推荐

Metrics

本文评价

[1]	张留莹,王鹏飞,张峰,刘海龙,林鹏飞,王涛,韦俊林,田少博,姜金荣,迟学斌. 海洋环流模式LICOM的GPU实现与优化[J]. 数据与计算发展前沿, 2020, 2(4): 92-104.
[2]	扶月月,王武,王乔. 基于FMM-PM方法的宇宙N体模拟在GPU上的实现和优化[J]. 数据与计算发展前沿, 2020, 2(2): 155-164.
[3]	党冠麟,刘世伟,胡晓东,张鉴,李新亮. 基于CPU/GPU异构系统架构的高超声速湍流直接数值模拟研究[J]. 数据与计算发展前沿, 2020, 2(1): 105-116.

1: function dynamic(Lp, Li, Lx, level_p, level_i, tmpMem, tmpMem1, n, level, offset) 2: k = level_i[level_p[level]+offset+blockIdx.x]; 3: d = Lx[Lp[k]]; 4: if abs(d)<1e-5 then 5: Lx[Lp[k]]=1e-5; 6: d=1e-5; 7: end if 8: subColSize=Lp[k+1]-Lp[k]-1; 9: factorize<<<(subColSize+1023)/1024,1024>>> (Lp,Li,Lx,tmpMem,tmpMem1,d,n,k,blockIdx.x);
10: update<<<subColSize,1024>>>(Lp,Li,Lx, tmpMem,tmpMem1,n,k,blockIdx.x); 11: cleartmpMem<<<(subColSize+1023)/1024, 1024>>>(Lp,Li,tmpMem,tmpMem1,n,k,blockIdx.x); 12: end function