Research on Checkpoint-Based Elastic Training Methods for Large Language Models

doi:10.11871/jfdc.issn.2096-742X.2025.01.010

Abstract

Abstract:

[Objective] In view of the huge demand for computing resources in training the large language models, the elastic training method of distributed training framework is crucial to ensure that the model can make elastic adjustments when resources change to ensure smooth training. [Methods] To meet this urgent need, this article proposes an elastic training mechanism for 3D parallel training and memory optimization, and introduces an elastic training method based on checkpoint into the deep learning framework Megatron-DeepSpeed. [Results] The checkpoint-based elastic training method proposed in this paper is suitable for model training with different resource allocations. By implementing 3D parallel training and memory optimized elastic training strategy on the LLaMA1 model, 8 groups of experiments are set for comparison. The experimental indexes such as loss value change, job completion time, and memory allocation are compared after elastic resource change. [Conclusions] The experimental results show that the model loss value is similar under different parallel degrees after elastic resource changes, which confirms the scalability of the model under different resource configurations. The model can maintain the continuity of model training under resource constraints, and the training speed and performance of the model can be significantly improved by increasing computing resources and parallel degrees under sufficient resources.

Key words: large language model, distributed training, elastic training, checkpoint

WANG Zijian, LI Kai, CAO Rongqiang, ZHOU Chunbao. Research on Checkpoint-Based Elastic Training Methods for Large Language Models[J]. Frontiers of Data and Computing, 2025, 7(1): 135-151, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.01.010.

Figures/Tables 20

Fig.1

Table 1

Table 2

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Table 3

Table 4

Table 5

Table 6

Fig.7

Fig.8

Fig.9

Fig.10

Fig.11

Table 7

Fig.12

Table 8

References 14

[1]	TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: Open and efficient foundation language models[EB/OL]. [2023-02-27]. https://arxiv.org/abs/2302.13971.
[2]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// 31st Conference on Neural Information Processing Systems(NIPS 2017), California, USA: Curran Associates, 2017: 5998-6008.
[3]	PASZKE A, GROSS S, MASSA F, et al. Pyto-rch: An imperative style, high-performance dee-p learning library[J]. Advances in neural infor-mation processing systems, 2019, 32: 8026-8037.
[4]	SERGEEV A, DEL BALSO M. Horovod: fast a-nd easy distributed deep learning in TensorFlo-w[EB/OL]. [2020-08-10]. https://arxiv.org/abs/1802.05799.
[5]	BI R, XU T, XU M, et al. PaddlePaddle: A P-roduction-Oriented Deep Learning Platform Fac-ilitating the Competency of Enterprises[C]// 2022IEEE 24th Int Conf on High Performance Co-mputing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on S-mart City; 8th Int Conf on Dependability in S-ensor, Cloud & Big Data Systems & Applicati-on (HPCC/DSS/SmartCity/DependSys), IEEE, 2022: 92-99.
[6]	MAI L, Li G, WAGENLÄNDER M, et al. {K-ungFu}: Making training in distributed machine learning adaptive[C]// 14th USENIX Symposiumon Operating Systems Design and Implementation (OSDI 20), 2020: 937-954.
[7]	QIAO A, CHOE S K, SUBRAMANYA S J, et al. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning[C]// 15th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI’21), 2021:1.
[8]	GUO R B, GUO V, KIM A, et al. Hydrozoa: Dynamic Hybrid-Parallel DNN Training on Serverless Containers[C]// Machine Learning and Systems 4, 2020: 779-794.
[9]	HE C, LI S, SOLTANOLKOTABI M, et al. PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transfor-mers[EB/OL].[2021-02-12]. https://arxiv.org/abs/2102.03161.
[10]	SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter l-anguage models using model parallelism[EB/OL]. [2020-03-13]. https://arxiv.org/abs/1909.08053.
[11]	RAJBHANDARI S, RASLEY J, RUWASE O, et al. Zero: Memory optimizations toward train-ing trillion parameter models[C]//SC20:Interna-tional Conference for High Performance Comp-uting, Networking, Storage and Analysis, IEEE, 2020: 1-16.
[12]	DEAN J, CORRADO G, MONGA R, et al. L-arge scale distributed deep networks[J]. Advan-ces in neural information processing systems, 2012, 25: 1223-1231.
[13]	HUANG Y, CHENG Y, BAPNA A, et al. Gpi-pe: Efficient training of giant neural networks using pipeline parallelism[J]. Advances in neur-al information processing systems, 2019, 32: 103-112.
[14]	Microsoft Research. DeepSpeed: Extreme-scale model training for everyone[EB/OL]. (2020-09-10)[2024-06-30]. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/.

参数	数值	解释
NLAYERS	1或2	隐藏层层数
NHIDDEN	768	隐藏层神经元量
NHEADS	16	多扣注意力头数
SEQ_LEN	2,048	输入序列长度
NUM_KV_HEADS	4	键值头个数
GLOBAL_BATCH_SIZE	8	全局批次大小
LR	6e-4	学习率

参数名	数量	起始位置
word_embeddings.weight(WE)	42467,328	0
2.self_attention.query.weight(2.q)	589,824	42467,328
2.self_attention.key_value.weight(2.kv)	294,912	43057,152
2.self_attention.dense.weight(2.d)	589,824	43352,064
2.mlp.dense_h_to_4h.weight(2.mlp0)	3145,728	43941,888
2.mlp.dense_4h_to_h.weight(2.mlp1)	1572,864	47087,616
2.input_layernorm.weight(2.il)	768	0
2.post_attention_layernorm.weight(2.pl)	768	768
Weight（w）	768	1536

参数列表	数据量
optimizer_state.state.0.exp_avg	torch.Size([48660480])
optimizer_state.state.0.exp_avg_sq	torch.Size([48660480])
optimizer_state.state.1.exp_avg	torch.Size([2304])
optimizer_state.state.1.exp_avg_sq	torch.Size([2304])
single_partition_of_fp32_groups_0	torch.Size([48660480])
single_partition_of_fp32_groups_1	torch.Size([2304])

Fp32参数列表	模型具体层参数
single_partition_of_fp32_groups_0	word_embeddings.weight(WE)
	2.self_attention.query.weight(2.q)
	2.self_attention.key_value.weight(2.kv)
	2.self_attention.dense.weight(2.d)
	2.mlp.dense_h_to_4h.weight(2.mlp0)
	2.mlp.dense_4h_to_h.weight(2.mlp1)
single_partition_of_fp32_groups_1	2.input_layernorm.weight(2.il)
	2.post_attention_layernorm.weight(2.pl)
	Weight（w）

文件名	参数名(简称)	偏移量(存储位置)
ZeRO00	WE	(numel=12165120， start=0)
	2.il	(numel=768， start=0)
ZeRO01	WE	(numel=12165120， start=0)
	2.il	(numel=768， start=0)
ZeRO02	WE	(numel=12165120， start=0)
	3.il	(numel=768， start=0)
	3.pl	(numel=384， start=768)
ZeRO03	WE	(numel=12165120， start=0)
	3.il	(numel=768， start=0)
	3.pl	(numel=384， start=768)
ZeRO10	WE	(numel=9068544， start=0)
	2.q	(numel=294912， start=9068544)
	2.kv	(numel=147456， start=9363456)
	2.d	(numel=294912， start=9510912)
	2.mlp0	(numel=1572864， start=9805824)
	2.mlp1	(numel=786432， start=11378688)
	2.pl	(numel=768， start=0)
ZeRO11	WE	(numel=9068544， start=0)
	2.q	(numel=294912， start=9068544)
	2.kv	(numel=147456， start=9363456)
	2.d	(numel=294912， start=9510912)
	2.mlp0	(numel=1572864， start=9805824)
	2.mlp1	(numel=786432， start=11378688)
	2.pl	(numel=768， start=0)
ZeRO12	WE	(numel=9068544， start=0)
	3.q	(numel=294912， start=9068544)
	3.kv	(numel=147456， start=9363456)
	3.d	(numel=294912， start=9510912)
	3.mlp0	(numel=1572864， start=9805824)
	3.mlp1	(numel=786432， start=11378688)
	3.pl	(numel=384, start=0)
	4.w	(numel=768, start=384)
ZeRO13	WE	(numel=9068544， start=0)
	3.q	(numel=294912， start=9068544)
	3.kv	(numel=147456， start=9363456)
	3.d	(numel=294912， start=9510912)
	3.mlp0	(numel=1572864， start=9805824)
	3.mlp1	(numel=786432， start=11378688)
	3.pl	(numel=384, start=0)
	4.w	(numel=768, start=384)