Porting and Adapting Deep Learning Framework Operators on Domestic Supercomputers

doi:10.11871/jfdc.issn.2096-742X.2025.06.013

Abstract

Abstract:

[Application background] With the rapid development of large-scale deep learning models, the computing resources required for training large-scale models are constantly increasing, and a single computing device is no longer able to meet the training requirements of large-scale deep learning models. Therefore, in the field of deep learning, providing support for supercomputing platforms is of great strategic significance. As a domestically independently developed deep learning framework, MindSpore has become one of the important tools in the field of artificial intelligence research due to its efficient computing performance, flexible debugging functions, and convenient support for distributed training. [Problem] The MindSpore framework does not support Sugon high-performance computers and cannot be directly deployed and run on this supercomputing platform, which severely limits the application of the MindSpore framework in the supercomputing environment. [Method] To address the issue that the MindSpore framework cannot run on the Sugon high-performance computer because of the different hardware architecture and software environment of the Sugon high-performance computer, this paper presents the efforts in porting and adapting the MindSpore framework to the Sugon computer. The Sugon high-performance computer adopts a heterogeneous architecture composed of Hygon CPU and Hygon DCU. The lack of support from the MindSpore framework for this supercomputing platform is manifested in the fact that the operators in the framework cannot be scheduled and executed on Hygon DCU. Therefore, based on the original GPU operators in the framework, this project designs an operator transforming scheme for Hygon DCU. [Result] Based on the operator transforming scheme for Hygon DCU, a total of 278 operators were successfully translated in this project, enabling the MindSpore framework to run on Sugon high-performance computers. Furthermore, on the Sugon high-performance computer, a distributed parallel training of the LLaMA model was carried out to verify the good performance of the Hygon DCU operators in the MindSpore framework.

Key words: deep learning framework, supercomputer, operator migration, distributed parallel training

ZHOU Faguo,LIU Fang,WANG Yangang,WANG Jue,YU Miao,LI Shunde,ZHOU Chunbao,WANG Jing,YANG Qinmeng. Porting and Adapting Deep Learning Framework Operators on Domestic Supercomputers[J]. Frontiers of Data and Computing, 2025, 7(6): 136-148, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.06.013.

Figures/Tables 15

Fig.1

Fig.2

Table 1

Table 2

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

Table 3

Table 4

Table 5

Table 6

Table 7

Fig.8

References 15

[1]	BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.
[2]	ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023.
[3]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint arXiv: 2302.13971, 2023.
[4]	NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on gpu clusters using megatron-lm[C]// Proceedings of the international conference for high performance computing, networking, storage and analysis, 2021: 1-15.
[5]	MICIKEVICIUS P, NARANG S, ALBEN J, et al. Mixed precision training[J]. arXiv preprint arXiv:1710.03740, 2017.
[6]	尹宝才, 王文通, 王立春. 深度学习研究综述[J]. 北京工业大学学报, 2015, 41(1): 48-59.
[7]	ABADI M, BARHAM P, CHEN J, et al. TensorFlow: a system for Large-Scale machine learning[C]// 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016: 265-283.
[8]	PASZKE A. Pytorch: An imperative style, high-performance deep learning library[J]. arXiv preprint arXiv:1912.01703, 2019.
[9]	Huawei Technologies Co., Ltd. Huawei mindspore ai development framework[M]// Artificial Intelligence T-echnology. Singapore: Springer Nature Singapore, 2022: 137-162.
[10]	MA Y, YU D, WU T, et al. PaddlePaddle: An open-source deep learning platform from industrial practice[J]. Frontiers of Data and Computing, 2019, 1(1): 105-115.
[11]	郝萌, 田雪洋, 鲁刚钊, 等. 基于国产DCU异构平台的图匹配算法移植与优化[J]. 计算机科学, 2024, 51 (4): 67-77.
[12]	GHORPADE J, PARANDE J, KULKARNI M, et al. GPGPU processing in CUDA architecture[J]. arXiv preprint arXiv:1202.4347, 2012.
[13]	KIRK D. NVIDIA CUDA software and GPU parallel computing architecture[C]// ISMM. 2007, 7: 103-104.
[14]	What is ROCm?[EB/OL]. AMD ROCm Documentation, [2025-03-09]. https://rocm.docs.amd.com/en/latest/what-is-rocm.html.
[15]	What is HIP?[EB/OL]. AMD HIP Documentation, [20250309]. https://rocm.docs.amd.com/projects/H-IP/en/latest/what_is_hip.html.

CUDA	HIP
cudaMalloc()	hipMalloc()
cudaFree()	hipFree()
cudaMemcpy()	hipMemcpy()
cudaMemGetInfo()	hipMemGetInfo()
kernel<<<…>>>()	hipLaunchKernelGGL()
cudaErrorMemoryAllocation	hipErrorOutOfMemory
cudaSuccess	hipSuccess
__shfl_down_sync()	无直接对照
__shfl_xor_sync()	无直接对照
__syncwarp()	无直接对照
__syncthreads()	__syncthreads()
__activemask()	无直接对照

错误类型	解决方案
__syncwarp不支持	根据具体代码底层逻辑，使用__threadfence()和__syncthreads()进行替换
CUDA代码与HIP库函数命名冲突	修改冲突的函数名称，确保与HIP库函数不发生命名冲突
DCU共享内存与本地内存不足	调整启动核函数时的线程块规模

算子	测试命令	结果
cumsum_op	pytest test_cumsum_op.py	5 passed
matmul_op	pytest test_matmul_op.py	5 passed
broadcast_op	pytest test_broadcast_op.py	8 passed
reduce_all_op	pytest test_reduce_all_op.py	2 passed
reduce_any_op	pytest test_reduce_any_op.py	2 passed
slice_grad	pytest test_slice_grad.py	3 passed
apply_add_sign_op	pytest test_apply_add_sign_op.py	1 passed
assign_op	pytest test_assign_op.py	2 passed
bool_op	pytest test_bool_op.py	2 passed
ceil_op	pytest test_ceil_op.py	6 passed
celu_op	pytest test_celu_op.py	5 passed
compare_ops	pytest test_compare_ops.py	14 passed
l2normalize_op	pytest test_l2normalize_op.py	1 passed

输入数据维度/k值	方案一运行时间（s）	方案二运行时间（s）	加速比
（512,102,4）/64	0.003361251	0.00330713	1.0164
（512,102,4）/512	0.004263939	0.003926142	1.0860
（164,096,0）/64	0.004064101	0.00405985	1.0010
（164,096,0）/4096	0.009131819	0.007113281	1.28377

算子	功能
reshape_op	改变张量的维度
transpose_op	对张量进行维度转置
cast_op	转换张量的数据类型
mul_op	实现两张量的逐元素相乘
add_op	实现两张量的逐元素相加
batch_matmul	实现两张量的矩阵乘法运算