Optimization Method for Large Language Models on Domestic Supercomputer System

doi:10.11871/jfdc.issn.2096-742X.2025.02.012

Abstract

Abstract:

[Objective] In order to reduce the training cost of large language models on domestic supercomputer systems, we propose an optimization method. [Methods] In this article, we build a communication backend based on MPI and UCC, combining the rapid construction of process groups with low-latency collective communication, and introduces a compression-based collective communication optimization method. [Results] Through training experiments for large language models with various configurations on domestic supercomputer systems, our proposed optimization method effectively reduces training costs. [Conclusions] Experimental results demonstrate the effectiveness of the proposed large model training optimization method in reducing training costs.

Key words: large language model, distributed training, collective communication, data compression

QU Zhiyong,WANG Xiaoguang,ZHOU Chunbao,SHI Yuanxiang,QIAO Jiawei. Optimization Method for Large Language Models on Domestic Supercomputer System[J]. Frontiers of Data and Computing, 2025, 7(2): 120-129, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.02.012.

Figures/Tables 13

Fig.1

Alg.1

Fig.2

Alg.2

Fig.3

Fig.4

Table 1

Fig.5

Table 2

Table 3

Fig.6

Fig.7

Fig.8

References 22

[1]	BROWN T, MANN B, RYDER N, et al. Language Models are Few-Shot Learners[C]. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[2]	KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling Laws for Neural Language Models[Z]. ArXiv, 2020:abs/2001.08361.
[3]	DEAN J, CORRADO G, MONGA R, et al. Large Scale Distributed Deep Networks[C]. In Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012: 1223-1231.
[4]	NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021: 1-15.
[5]	YU E, DONG D, LIAO X. Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 34(12): 3294-3308.
[6]	SEIDE F, FU H, DROPPO J, et al. 1-Bit Stochastic Gradient Descent and Its Application to Data-Parallel Distributed Training of Speech DNNs[C]. Interspeech, 2014: 1058-1062.
[7]	ALISTARH D, GRUBIC D, LI J, et al. QSGD: Communication-efficient SGD via Gradient Quantization and Encoding[J]. Advances in Neural Information Processing Systems, 2017: 1707-1718.
[8]	WU J, HUANG W, HUANG J, et al. Error Compensated Quantized SGD and Its Applications to Large-Scale Distributed Optimization[C]. International Conference on Machine Learning, 2018: 5325-5333.
[9]	ZHANG H, LI J, KARA K, et al. ZipML: Training Linear Models with End-To-End Low Precision, and a Little Bit of Deep Learning[C]. International Conference on Machine Learning, 2017: 4035-4043.
[10]	KARIMIREDDY S, REBJOCK Q, STICH S, et al. Error Feedback Fixes SignSGD and Other Gradient Compression Schemes[C]. International Conference on Machine Learning, 2019: 3252-3261.
[11]	HUANG J, DI S, YU X, et al. An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression[C]. IEEE International Parallel and Distributed Processing Symposium, 2024: 752-764.
[12]	FENG H, ZHANG B, YE F, et al. Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression[C]. The International Conference for High Performance Computing, Networking, Storage and Analysis, 2024: 17-22.
[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C]. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[14]	DEVLIN J, CHANG M, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[15]	ZINKEVICH M, WEIMER M, LI L, et al. Parallelized Stochastic Gradient Descent[C]. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2010: 2595-2603.
[16]	HUANG Y, CHENG Y, BAPNA A, et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism[C]. Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 103-112.
[17]	NARAYANAN D, HARLAP A, PHANISHAYEE A, et al. PipeDream: Generalized Pipeline Parallelism for DNN Training[C]. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019: 1-15.
[18]	SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism[Z]. ArXiv, 2019: abs/1909.08053.
[19]	SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model[Z]. ArXiv, 2022: abs/2201.11990.
[20]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and Efficient Foundation Language Models[Z]. ArXiv, 2023: abs/2302.13971.
[21]	YU X, DI S, ZHAO K, et al. Ultrafast Error-bounded Lossy Compression for Scientific Datasets[C]. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, 2022: 159-171.
[22]	ZHOU Q, CHU C, KUMAR N, et al. Designing high-performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters[C]. IEEE International Parallel and Distributed Processing Symposium, 2021: 444-453.

算子	时间 (ms)	占比 (%)
linear	206	11.63
baddbmm	9	0.50
softmax	7.5	0.42
bmm	9.5	0.53
linear	100	5.64
all_reduce	225	12.70
linear	260	14.68
ReLU	205	11.57
linear	197	11.12
all_reduce	225	12.70
send	24	1.41
recv	303	17.10