面向国产超算系统的大模型训练优化方法

doi:10.11871/jfdc.issn.2096-742X.2025.02.012

数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (2): 120-129.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.02.012

doi: 10.11871/jfdc.issn.2096-742X.2025.02.012

面向国产超算系统的大模型训练优化方法

屈志勇¹(),王晓光²,周纯葆^2,^*(),史源香¹,乔嘉伟¹

1.山西省气象信息中心，山西太原 030006
2.中国科学院计算机网络信息中心，北京 100083

收稿日期:2024-11-04 出版日期:2025-04-20 发布日期:2025-04-23
通讯作者: 周纯葆
作者简介:屈志勇，山西省气象信息中心，高级工程师，主要研究方向为气象信息技术。
本文中主要负责方法研究与实验，论文撰写。
QU Zhiyong is a senior engineer at the Shanxi Meteorological Information Center. His main research direction is meteorological information technology.
In this paper, he is responsible for method research, experimental design, and paper writing.
E-mail: 153224922@qq.com|周纯葆，中国科学院计算机网络信息中心，硕士生导师，研究员，博士，主要研究方向为并行计算、人工智能基础算法与软件。
本文中主要负责方法设计和实验指导。
ZHOU Chunbao, Ph.D., is a researcher and master supervisor at the Computer Network Information Center, Chinese Academy of Sciences. His main research directions include parallel computing, basic algorithms and software for artificial intelligence.
In this paper, he is responsible for method design and experimental guidance.
E-mail: zhoucb@cnic.cn
基金资助:
山西省气象局揭榜挂帅项目(SXKJBGS202409);山西省档案科技项目共同资助(2024-SX-002);国家气象信息中心重点创新团队(NMIC-2024-ZD08)

Optimization Method for Large Language Models on Domestic Supercomputer System

QU Zhiyong¹(),WANG Xiaoguang²,ZHOU Chunbao^2,^*(),SHI Yuanxiang¹,QIAO Jiawei¹

1. Shanxi Meteorological Information Center, Taiyuan, Shanxi 030006, China
2. Computer Network Information Center, Chinese Academy of Science, Beijing 100083, China

Received:2024-11-04 Online:2025-04-20 Published:2025-04-23
Contact: ZHOU Chunbao

摘要/Abstract

摘要：

【目的】为了降低国产超算系统上的大模型训练开销，研发一套大模型训练优化方法。【方法】本文基于MPI与UCC形成一套通信后端，将进程组快速构建与低延迟集合通信相结合，在此基础上引入基于压缩的集合通信优化方法。【结果】通过在国产超算系统上多种配置下的大模型训练实验，本文提出的优化方法可以有效减少训练开销。【结论】实验结果证明了本文提出的大模型训练优化方法在减少训练开销方面的有效性。

关键词: 大语言模型, 分布式训练, 集合通信, 数据压缩

Abstract:

[Objective] In order to reduce the training cost of large language models on domestic supercomputer systems, we propose an optimization method. [Methods] In this article, we build a communication backend based on MPI and UCC, combining the rapid construction of process groups with low-latency collective communication, and introduces a compression-based collective communication optimization method. [Results] Through training experiments for large language models with various configurations on domestic supercomputer systems, our proposed optimization method effectively reduces training costs. [Conclusions] Experimental results demonstrate the effectiveness of the proposed large model training optimization method in reducing training costs.

Key words: large language model, distributed training, collective communication, data compression

屈志勇,王晓光,周纯葆,史源香,乔嘉伟. 面向国产超算系统的大模型训练优化方法[J]. 数据与计算发展前沿, 2025, 7(2): 120-129.

QU Zhiyong,WANG Xiaoguang,ZHOU Chunbao,SHI Yuanxiang,QIAO Jiawei. Optimization Method for Large Language Models on Domestic Supercomputer System[J]. Frontiers of Data and Computing, 2025, 7(2): 120-129, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.02.012.

图/表 13

图1

算法1

图2

算法2

图3

图4

表1

图5

表2

表3

图6

图7

图8

参考文献 22

[1]	BROWN T, MANN B, RYDER N, et al. Language Models are Few-Shot Learners[C]. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[2]	KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling Laws for Neural Language Models[Z]. ArXiv, 2020:abs/2001.08361.
[3]	DEAN J, CORRADO G, MONGA R, et al. Large Scale Distributed Deep Networks[C]. In Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012: 1223-1231.
[4]	NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021: 1-15.
[5]	YU E, DONG D, LIAO X. Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 34(12): 3294-3308.
[6]	SEIDE F, FU H, DROPPO J, et al. 1-Bit Stochastic Gradient Descent and Its Application to Data-Parallel Distributed Training of Speech DNNs[C]. Interspeech, 2014: 1058-1062.
[7]	ALISTARH D, GRUBIC D, LI J, et al. QSGD: Communication-efficient SGD via Gradient Quantization and Encoding[J]. Advances in Neural Information Processing Systems, 2017: 1707-1718.
[8]	WU J, HUANG W, HUANG J, et al. Error Compensated Quantized SGD and Its Applications to Large-Scale Distributed Optimization[C]. International Conference on Machine Learning, 2018: 5325-5333.
[9]	ZHANG H, LI J, KARA K, et al. ZipML: Training Linear Models with End-To-End Low Precision, and a Little Bit of Deep Learning[C]. International Conference on Machine Learning, 2017: 4035-4043.
[10]	KARIMIREDDY S, REBJOCK Q, STICH S, et al. Error Feedback Fixes SignSGD and Other Gradient Compression Schemes[C]. International Conference on Machine Learning, 2019: 3252-3261.
[11]	HUANG J, DI S, YU X, et al. An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression[C]. IEEE International Parallel and Distributed Processing Symposium, 2024: 752-764.
[12]	FENG H, ZHANG B, YE F, et al. Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression[C]. The International Conference for High Performance Computing, Networking, Storage and Analysis, 2024: 17-22.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C]. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[14]	DEVLIN J, CHANG M, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[15]	ZINKEVICH M, WEIMER M, LI L, et al. Parallelized Stochastic Gradient Descent[C]. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2010: 2595-2603.
[16]	HUANG Y, CHENG Y, BAPNA A, et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 103-112.
[17]	NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: Generalized Pipeline Parallelism for DNN Training[C]. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019: 1-15.
[18]	SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism[Z]. ArXiv, 2019: abs/1909.08053.
[19]	SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model[Z]. ArXiv, 2022: abs/2201.11990.
[20]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and Efficient Foundation Language Models[Z]. ArXiv, 2023: abs/2302.13971.
[21]	YU X, DI S, ZHAO K, et al. Ultrafast Error-bounded Lossy Compression for Scientific Datasets[C]. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022: 159-171.
[22]	ZHOU Q, CHU C, KUMAR N, et al. Designing high-performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters[C]. IEEE International Parallel and Distributed Processing Symposium, 2021: 444-453.

算子	时间 (ms)	占比 (%)
linear	206	11.63
baddbmm	9	0.50
softmax	7.5	0.42
bmm	9.5	0.53
linear	100	5.64
all_reduce	225	12.70
linear	260	14.68
ReLU	205	11.57
linear	197	11.12
all_reduce	225	12.70
send	24	1.41
recv	303	17.10

面向国产超算系统的大模型训练优化方法

Optimization Method for Large Language Models on Domestic Supercomputer System

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 22

相关文章 6

编辑推荐

Metrics

本文评价

[1]	王子健, 李凯, 曹荣强, 周纯葆. 基于检查点的大模型弹性训练方法研究[J]. 数据与计算发展前沿, 2025, 7(1): 135-151.
[2]	韦一金, 陈彦清, 王秀东, 樊景超. 基于大语言模型的《中国小麦品种志》信息提取[J]. 数据与计算发展前沿, 2025, 7(1): 175-185.
[3]	马秋平, 张琪, 赵晓凡. 图表问答研究综述[J]. 数据与计算发展前沿, 2025, 7(1): 19-37.
[4]	裴炳森,李欣,蒋章涛,刘明帅. 基于大语言模型的司法文本摘要生成与评价技术研究[J]. 数据与计算发展前沿, 2024, 6(6): 62-73.
[5]	韦一金, 樊景超. 基于ChatGLM2-6B的农业政策问答系统[J]. 数据与计算发展前沿, 2024, 6(4): 116-127.
[6]	曾瀞瑶,苑娜,魏文娟,李根,杜政霖. 高通量计算在大规模人群队列基因组数据解析应用中的挑战[J]. 数据与计算发展前沿, 2020, 2(1): 117-127.