Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (6): 136-148.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.06.013

doi: 10.11871/jfdc.issn.2096-742X.2025.06.013

• Technology and Application • Previous Articles     Next Articles

Porting and Adapting Deep Learning Framework Operators on Domestic Supercomputers

ZHOU Faguo1(),LIU Fang2,*(),WANG Yangang2,WANG Jue2,YU Miao1,LI Shunde2,ZHOU Chunbao2,WANG Jing2,YANG Qinmeng2   

  1. 1. School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing 100083, China
    2. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2025-04-23 Online:2025-12-20 Published:2025-12-17
  • Contact: LIU Fang E-mail:zhoufaguo@cumtb.edu.cn;liufang@sccas.cn

Abstract:

[Application background] With the rapid development of large-scale deep learning models, the computing resources required for training large-scale models are constantly increasing, and a single computing device is no longer able to meet the training requirements of large-scale deep learning models. Therefore, in the field of deep learning, providing support for supercomputing platforms is of great strategic significance. As a domestically independently developed deep learning framework, MindSpore has become one of the important tools in the field of artificial intelligence research due to its efficient computing performance, flexible debugging functions, and convenient support for distributed training. [Problem] The MindSpore framework does not support Sugon high-performance computers and cannot be directly deployed and run on this supercomputing platform, which severely limits the application of the MindSpore framework in the supercomputing environment. [Method] To address the issue that the MindSpore framework cannot run on the Sugon high-performance computer because of the different hardware architecture and software environment of the Sugon high-performance computer, this paper presents the efforts in porting and adapting the MindSpore framework to the Sugon computer. The Sugon high-performance computer adopts a heterogeneous architecture composed of Hygon CPU and Hygon DCU. The lack of support from the MindSpore framework for this supercomputing platform is manifested in the fact that the operators in the framework cannot be scheduled and executed on Hygon DCU. Therefore, based on the original GPU operators in the framework, this project designs an operator transforming scheme for Hygon DCU. [Result] Based on the operator transforming scheme for Hygon DCU, a total of 278 operators were successfully translated in this project, enabling the MindSpore framework to run on Sugon high-performance computers. Furthermore, on the Sugon high-performance computer, a distributed parallel training of the LLaMA model was carried out to verify the good performance of the Hygon DCU operators in the MindSpore framework.

Key words: deep learning framework, supercomputer, operator migration, distributed parallel training