基于检查点的大模型弹性训练方法研究

doi:10.11871/jfdc.issn.2096-742X.2025.01.010

数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (1): 135-151.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.01.010

doi: 10.11871/jfdc.issn.2096-742X.2025.01.010

基于检查点的大模型弹性训练方法研究

王子健^1,²(),李凯¹,曹荣强¹,周纯葆^1,^*()

1.中国科学院计算机网络信息中心，北京 100083
2.中国科学院大学，北京 101408

收稿日期:2024-06-30 出版日期:2025-02-20 发布日期:2025-02-21
通讯作者: *周纯葆（E-mail: zhoucb@cnic.cn）
作者简介:王子健，中国科学院计算机网络信息中心，硕士研究生，主要研究方向为并行计算。
本文中主要负责方法研究、论文撰写与实验设计。
WANG Zijian, is a master’s student of the Computer Network Information Center, Chinese Academy of Sciences. His main research direction is parallel computing.
In this paper, he is responsible for method research, paper writing, and experimental design.
E-mail: 1364043791@qq.com|周纯葆，中国科学院计算机网络信息中心，硕士生导师，研究员，博士，主要研究方向为并行计算、人工智能基础算法与软件。
本文中主要负责方法设计和实验指导。
ZHOU Chunbao, Ph.D., is a researcher and master supervisor of the Computer Network Information Center, Chinese Academy of Sciences. His main research directions include parallel computing, basic algorithms and software for artificial intelligence.
In this paper, he is responsible for method design and experimental guidance.
E-mail: zhoucb@cnic.cn
基金资助:
国家电网有限公司总部科技项目(5700-202358842A-4-3-WL)

Research on Checkpoint-Based Elastic Training Methods for Large Language Models

WANG Zijian^1,²(),LI Kai¹,CAO Rongqiang¹,ZHOU Chunbao^1,^*()

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
2. University of Chinese Academy of Sciences, Beijing 101408, China

Received:2024-06-30 Online:2025-02-20 Published:2025-02-21

摘要/Abstract

摘要：

【目的】鉴于大语言模型在训练过程中对计算资源的巨大需求，分布式训练框架的弹性训练方法至关重要，确保模型能够在资源变更时做出弹性调整来保证训练顺利进行。【方法】对于这种迫切需求，本文提出了面向大语言模型的3D并行训练和显存优化弹性训练机制，在深度学习框架Megatron-DeepSpeed中引入了基于检查点技术的弹性训练方法，为框架增添了弹性功能。【结果】本文提出的基于检查点的弹性训练方法适配于不同资源配置的模型训练，通过对LLaMA1模型实施3D并行训练和显存优化的弹性策略，设置8组实验比较资源改变后损失值变化、作业完成时间、显存分配占比等实验指标。【结论】实验结果表明资源变化后不同并行度下的模型损失值相近，验证了模型在不同资源配置下的可扩展性，模型在资源紧张情况下能保持模型训练的连续性，模型在资源充足情况下增加计算资源以及并行度能显著提高训练速度和性能。

关键词: 大语言模型, 分布式训练, 弹性训练, 检查点

Abstract:

[Objective] In view of the huge demand for computing resources in training the large language models, the elastic training method of distributed training framework is crucial to ensure that the model can make elastic adjustments when resources change to ensure smooth training. [Methods] To meet this urgent need, this article proposes an elastic training mechanism for 3D parallel training and memory optimization, and introduces an elastic training method based on checkpoint into the deep learning framework Megatron-DeepSpeed. [Results] The checkpoint-based elastic training method proposed in this paper is suitable for model training with different resource allocations. By implementing 3D parallel training and memory optimized elastic training strategy on the LLaMA1 model, 8 groups of experiments are set for comparison. The experimental indexes such as loss value change, job completion time, and memory allocation are compared after elastic resource change. [Conclusions] The experimental results show that the model loss value is similar under different parallel degrees after elastic resource changes, which confirms the scalability of the model under different resource configurations. The model can maintain the continuity of model training under resource constraints, and the training speed and performance of the model can be significantly improved by increasing computing resources and parallel degrees under sufficient resources.

Key words: large language model, distributed training, elastic training, checkpoint

王子健, 李凯, 曹荣强, 周纯葆. 基于检查点的大模型弹性训练方法研究[J]. 数据与计算发展前沿, 2025, 7(1): 135-151.

WANG Zijian, LI Kai, CAO Rongqiang, ZHOU Chunbao. Research on Checkpoint-Based Elastic Training Methods for Large Language Models[J]. Frontiers of Data and Computing, 2025, 7(1): 135-151, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.01.010.

图/表 20

图1

表1

表2

图2

图3

图4

图5

图6

表3

表4

表5

表6

图7

图8

图9

图10

图11

表7

图12

表8

参考文献 14

[1]	TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: Open and efficient foundation language models[EB/OL]. [2023-02-27]. https://arxiv.org/abs/2302.13971.
[2]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// 31st Conference on Neural Information Processing Systems(NIPS 2017), California, USA: Curran Associates, 2017: 5998-6008.
[3]	PASZKE A, GROSS S, MASSA F, et al. Pyto-rch: An imperative style, high-performance dee-p learning library[J]. Advances in neural infor-mation processing systems, 2019, 32: 8026-8037.
[4]	SERGEEV A, DEL BALSO M. Horovod: fast a-nd easy distributed deep learning in TensorFlo-w[EB/OL]. [2020-08-10]. https://arxiv.org/abs/1802.05799.
[5]	BI R, XU T, XU M, et al. PaddlePaddle: A P-roduction-Oriented Deep Learning Platform Fac-ilitating the Competency of Enterprises[C]// 2022IEEE 24th Int Conf on High Performance Co-mputing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on S-mart City; 8th Int Conf on Dependability in S-ensor, Cloud & Big Data Systems & Applicati-on (HPCC/DSS/SmartCity/DependSys), IEEE, 2022: 92-99.
[6]	MAI L, Li G, WAGENLÄNDER M, et al. {K-ungFu}: Making training in distributed machine learning adaptive[C]// 14th USENIX Symposiumon Operating Systems Design and Implementation (OSDI 20), 2020: 937-954.
[7]	QIAO A, CHOE S K, SUBRAMANYA S J, et al. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning[C]// 15th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI’21), 2021:1.
[8]	GUO R B, GUO V, KIM A, et al. Hydrozoa: Dynamic Hybrid-Parallel DNN Training on Serverless Containers[C]// Machine Learning and Systems 4, 2020: 779-794.
[9]	HE C, LI S, SOLTANOLKOTABI M, et al. PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transfor-mers[EB/OL].[2021-02-12]. https://arxiv.org/abs/2102.03161.
[10]	SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter l-anguage models using model parallelism[EB/OL]. [2020-03-13]. https://arxiv.org/abs/1909.08053.
[11]	RAJBHANDARI S, RASLEY J, RUWASE O, et al. Zero: Memory optimizations toward train-ing trillion parameter models[C]//SC20:Interna-tional Conference for High Performance Comp-uting, Networking, Storage and Analysis, IEEE, 2020: 1-16.
[12]	DEAN J, CORRADO G, MONGA R, et al. L-arge scale distributed deep networks[J]. Advan-ces in neural information processing systems, 2012, 25: 1223-1231.
[13]	HUANG Y, CHENG Y, BAPNA A, et al. Gpi-pe: Efficient training of giant neural networks using pipeline parallelism[J]. Advances in neur-al information processing systems, 2019, 32: 103-112.
[14]	Microsoft Research. DeepSpeed: Extreme-scale model training for everyone[EB/OL]. (2020-09-10)[2024-06-30]. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/.

参数	数值	解释
NLAYERS	1或2	隐藏层层数
NHIDDEN	768	隐藏层神经元量
NHEADS	16	多扣注意力头数
SEQ_LEN	2,048	输入序列长度
NUM_KV_HEADS	4	键值头个数
GLOBAL_BATCH_SIZE	8	全局批次大小
LR	6e-4	学习率

参数名	数量	起始位置
word_embeddings.weight(WE)	42467,328	0
2.self_attention.query.weight(2.q)	589,824	42467,328
2.self_attention.key_value.weight(2.kv)	294,912	43057,152
2.self_attention.dense.weight(2.d)	589,824	43352,064
2.mlp.dense_h_to_4h.weight(2.mlp0)	3145,728	43941,888
2.mlp.dense_4h_to_h.weight(2.mlp1)	1572,864	47087,616
2.input_layernorm.weight(2.il)	768	0
2.post_attention_layernorm.weight(2.pl)	768	768
Weight（w）	768	1536

参数列表	数据量
optimizer_state.state.0.exp_avg	torch.Size([48660480])
optimizer_state.state.0.exp_avg_sq	torch.Size([48660480])
optimizer_state.state.1.exp_avg	torch.Size([2304])
optimizer_state.state.1.exp_avg_sq	torch.Size([2304])
single_partition_of_fp32_groups_0	torch.Size([48660480])
single_partition_of_fp32_groups_1	torch.Size([2304])

Fp32参数列表	模型具体层参数
single_partition_of_fp32_groups_0	word_embeddings.weight(WE)
	2.self_attention.query.weight(2.q)
	2.self_attention.key_value.weight(2.kv)
	2.self_attention.dense.weight(2.d)
	2.mlp.dense_h_to_4h.weight(2.mlp0)
	2.mlp.dense_4h_to_h.weight(2.mlp1)
single_partition_of_fp32_groups_1	2.input_layernorm.weight(2.il)
	2.post_attention_layernorm.weight(2.pl)
	Weight（w）

文件名	参数名(简称)	偏移量(存储位置)
ZeRO00	WE	(numel=12165120， start=0)
	2.il	(numel=768， start=0)
ZeRO01	WE	(numel=12165120， start=0)
	2.il	(numel=768， start=0)
ZeRO02	WE	(numel=12165120， start=0)
	3.il	(numel=768， start=0)
	3.pl	(numel=384， start=768)
ZeRO03	WE	(numel=12165120， start=0)
	3.il	(numel=768， start=0)
	3.pl	(numel=384， start=768)
ZeRO10	WE	(numel=9068544， start=0)
	2.q	(numel=294912， start=9068544)
	2.kv	(numel=147456， start=9363456)
	2.d	(numel=294912， start=9510912)
	2.mlp0	(numel=1572864， start=9805824)
	2.mlp1	(numel=786432， start=11378688)
	2.pl	(numel=768， start=0)
ZeRO11	WE	(numel=9068544， start=0)
	2.q	(numel=294912， start=9068544)
	2.kv	(numel=147456， start=9363456)
	2.d	(numel=294912， start=9510912)
	2.mlp0	(numel=1572864， start=9805824)
	2.mlp1	(numel=786432， start=11378688)
	2.pl	(numel=768， start=0)
ZeRO12	WE	(numel=9068544， start=0)
	3.q	(numel=294912， start=9068544)
	3.kv	(numel=147456， start=9363456)
	3.d	(numel=294912， start=9510912)
	3.mlp0	(numel=1572864， start=9805824)
	3.mlp1	(numel=786432， start=11378688)
	3.pl	(numel=384, start=0)
	4.w	(numel=768, start=384)
ZeRO13	WE	(numel=9068544， start=0)
	3.q	(numel=294912， start=9068544)
	3.kv	(numel=147456， start=9363456)
	3.d	(numel=294912， start=9510912)
	3.mlp0	(numel=1572864， start=9805824)
	3.mlp1	(numel=786432， start=11378688)
	3.pl	(numel=384, start=0)
	4.w	(numel=768, start=384)

基于检查点的大模型弹性训练方法研究

Research on Checkpoint-Based Elastic Training Methods for Large Language Models

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 20

参考文献 14

相关文章 4

编辑推荐

Metrics

本文评价

类别	属性	规模/详情
硬件	节点数	1,000
	海光GPU显卡	1,000×4
	CPU显卡	1,000×30
	Home磁盘内存容量/GB	300
	Work磁盘内存容量/TB	100
软件	操作系统	Linux
	深度学习框架	Megatron-Deepspeed/Pytorch框架
	编译器	Vscode
	作业调度系统	Slurm
	MPI并行	mpi4y 3.1.4
	作业最大运行时间/h	24

实验类别	作业完成时间/s	显存分配/MB	第500步训练损失
ZeRO_0，Layer=1	526	1655.14843	6.21089
ZeRO_0，Layer=2	669	1739.98144	6.04959
ZeRO_0+DP， Layer=1	278	1655.14648	6.20906
ZeRO_0+TP， Layer=1	256	834.17968	6.26059
ZeRO_0+PP， Layer=2	535	1229.98618	6.05821
ZeRO_0+3D， Layer=2	196	1229.98618	6.10161
ZeRO_1，Layer=1	514	1655.14843	6.21089
ZeRO_1，Layer=2	664	1738.98144	6.04943
ZeRO_1+DP， Layer=1	276	1375.57080	6.21090
ZeRO_1+TP， Layer=1	249	833.17968	6.23539
ZeRO_1+PP， Layer=2	528	1229.98618	6.04942
ZeRO_1+3D， Layer=2	191	962.96187	6.04502

模型参数量	实验类别	MFU
NHIDDEN=768/NHEADS=16/SEQ_LEN=2048	ZeRO_0，Layer=1	20.73%
	ZeRO_0，Layer=2	19.07%
	ZeRO_0+DP， Layer=1	20.57%
	ZeRO_0+TP， Layer=1	22.47%
	ZeRO_0+PP， Layer=2	12.84%
	ZeRO_0+3D， Layer=2	10.04%
	ZeRO_1，Layer=1	20.21%
	ZeRO_1，Layer=2	18.97%
	ZeRO_1+DP， Layer=1	19.41%
	ZeRO_1+TP， Layer=1	21.87%
	ZeRO_1+PP， Layer=2	11.97%
	ZeRO_1+3D， Layer=2	10.51%
NHIDDEN=4096/NHEADS=32/ SEQ_LEN=2048/Layer=32(7B)	ZeRO_1_DP=64/TP=4/PP=1	19.25%
	ZeRO_1_DP=32/TP=4/PP=1	20.22%
	ZeRO_1_DP=16/TP=4/PP=1	21.90%
	ZeRO_1_DP=8/TP=4/PP=1	22.80%

[1]	韦一金, 陈彦清, 王秀东, 樊景超. 基于大语言模型的《中国小麦品种志》信息提取[J]. 数据与计算发展前沿, 2025, 7(1): 175-185.
[2]	马秋平, 张琪, 赵晓凡. 图表问答研究综述[J]. 数据与计算发展前沿, 2025, 7(1): 19-37.
[3]	裴炳森,李欣,蒋章涛,刘明帅. 基于大语言模型的司法文本摘要生成与评价技术研究[J]. 数据与计算发展前沿, 2024, 6(6): 62-73.
[4]	韦一金, 樊景超. 基于ChatGLM2-6B的农业政策问答系统[J]. 数据与计算发展前沿, 2024, 6(4): 116-127.