数据与计算发展前沿 ›› 2021, Vol. 3 ›› Issue (3): 136-147.doi: 10.11871/jfdc.issn.2096-742X.2021.03.012

• 技术与应用 • 上一篇    下一篇

稀疏对称矩阵的LDLT分解在GPU上的高效实现

陈鑫峰1,2(),王武1,*()   

  1. 1.中国科学院计算机网络信息中心,北京 100190
    2.中国科学院大学,北京 100049
  • 收稿日期:2021-02-01 出版日期:2021-06-20 发布日期:2021-07-09
  • 通讯作者: 王武
  • 作者简介:陈鑫峰,中国科学院计算机网络信息中心,在读硕士研究生,主要研究方向为高性能计算和并行计算。
    本文承担工作:实/复数稀疏对称矩阵的LDLT分解在GPU上的实现。
    Chen Xinfeng is a master student at Computer Network Information Center, Chinese Academy of Sciences. His main research interests are high performance computing and parallel computing.
    In this paper, he undertakes the following tasks: the implemen-tation and optimization of LDLT decomposition of complex/real sparse symmetric matrix on GPU.
    E-mail: chenxinfeng@cnic.cn|王武,中国科学院计算机网络信息中心,博士,副研究员,研究方向为并行算法、高性能计算。
    本文承担工作为GPU上的稀疏矩阵分解的算法指导。
    Wang Wu, Ph.D., is an associate researcher at Computer Network Information Center, Chinese Academy of Sciences. His main research interests are parallel algorithm and high performance computing. In this paper, he undertakes the following tasks: algorithm director of sparse matrix decomposi-tion on GPU.
    E-mail: wangwu@sccas.cn
  • 基金资助:
    国家重点研发计划项目“复杂电磁环境高性能应用软件系统研制及应用示范”(2017YFB 0202502);中国科学院“十三五”信息化专项“科研信息化应用工程”(XXH13506-405)

An Effective Implementation of LDLT Decomposition of Sparse Symmetric Matrix on GPU

Chen Xinfeng1,2(),Wang Wu1,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2021-02-01 Online:2021-06-20 Published:2021-07-09
  • Contact: Wang Wu

摘要:

【目的】 LDLT分解是求解很多稀疏对称线性系统的有效工具之一,尤其是对于迭代法难以收敛的问题。然而在GPU上实现LDLT分解存在困难,因为分解过程中存在数据依赖和不规则的数据访问。【方法】 本文设计并实现了一个基于GPU的稀疏对称矩阵的LDLT分解,它采用Cholesky的符号分解和右视分解算法、稀疏矩阵依赖图的层次划分,以及CUDA的动态并行核调度技术,算法的所有三层循环都并行化,从而获得更高的并行度。【结果】 实验结果表明,针对稀疏对称矩阵的一个典型的测试集,在GPU上实现的LDLT分解相对于UMFPACK最高加速46.2倍。【结论】 LDLT分解CUDA实现策略可为高性能GPU异构平台上开展稀疏矩阵的高性能数值算法研究与实现提供借鉴。

关键词: LDLT 分解, 右视算法, GPU, 动态并行

Abstract:

[Objective] LDLT decomposition is an effective tool to solve many problems in sparse symmetric linear systems, especially for those problems which are hard to converge using iterative solvers. However, it is difficult to implement LDLT on the GPU for data dependency and irregular data access during the factorization. [Methods] In this paper, an effective GPU-based LDLT decomposition method of sparse symmetric matrix is designed and implemented based on Cholesky symbolic decomposition, right-looking decomposition algorithm and level partition of the dependency graph for the sparse matrix. By using controlled kernel launch for CUDA dynamic parallelism, all three loops of the algorithm are parallelized, so the proposed method can achieve higher parallelism.[Results] Experimental results show that the implementation of LDLT on GPU can achieve a maximum speedup of 46.2 compared to UMFPACK for a typical collection of sparse symmetric matrix. [Conclusions] CUDA implementation of LDLT can give reference to high performance numerical algorithm research and implementation for sparse matrix on GPU-based heterogeneous platforms.

Key words: LDLT decomposition, right-looking algorithm, GPU, dynamic parallelism