数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (1): 135-151.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.01.010

doi: 10.11871/jfdc.issn.2096-742X.2025.01.010

• 技术与应用 • 上一篇    下一篇

基于检查点的大模型弹性训练方法研究

王子健1,2(),李凯1,曹荣强1,周纯葆1,*()   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,北京 101408
  • 收稿日期:2024-06-30 出版日期:2025-02-20 发布日期:2025-02-21
  • 通讯作者: *周纯葆(E-mail: zhoucb@cnic.cn
  • 作者简介:王子健,中国科学院计算机网络信息中心,硕士研究生,主要研究方向为并行计算。
    本文中主要负责方法研究、论文撰写与实验设计。
    WANG Zijian, is a master’s student of the Computer Network Information Center, Chinese Academy of Sciences. His main research direction is parallel computing.
    In this paper, he is responsible for method research, paper writing, and experimental design.
    E-mail: 1364043791@qq.com|周纯葆,中国科学院计算机网络信息中心,硕士生导师,研究员,博士,主要研究方向为并行计算、人工智能基础算法与软件。
    本文中主要负责方法设计和实验指导。
    ZHOU Chunbao, Ph.D., is a researcher and master supervisor of the Computer Network Information Center, Chinese Academy of Sciences. His main research directions include parallel computing, basic algorithms and software for artificial intelligence.
    In this paper, he is responsible for method design and experimental guidance.
    E-mail: zhoucb@cnic.cn
  • 基金资助:
    国家电网有限公司总部科技项目(5700-202358842A-4-3-WL)

Research on Checkpoint-Based Elastic Training Methods for Large Language Models

WANG Zijian1,2(),LI Kai1,CAO Rongqiang1,ZHOU Chunbao1,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 101408, China
  • Received:2024-06-30 Online:2025-02-20 Published:2025-02-21

摘要:

【目的】鉴于大语言模型在训练过程中对计算资源的巨大需求,分布式训练框架的弹性训练方法至关重要,确保模型能够在资源变更时做出弹性调整来保证训练顺利进行。【方法】对于这种迫切需求,本文提出了面向大语言模型的3D并行训练和显存优化弹性训练机制,在深度学习框架Megatron-DeepSpeed中引入了基于检查点技术的弹性训练方法,为框架增添了弹性功能。【结果】本文提出的基于检查点的弹性训练方法适配于不同资源配置的模型训练,通过对LLaMA1模型实施3D并行训练和显存优化的弹性策略,设置8组实验比较资源改变后损失值变化、作业完成时间、显存分配占比等实验指标。【结论】实验结果表明资源变化后不同并行度下的模型损失值相近,验证了模型在不同资源配置下的可扩展性,模型在资源紧张情况下能保持模型训练的连续性,模型在资源充足情况下增加计算资源以及并行度能显著提高训练速度和性能。

关键词: 大语言模型, 分布式训练, 弹性训练, 检查点

Abstract:

[Objective] In view of the huge demand for computing resources in training the large language models, the elastic training method of distributed training framework is crucial to ensure that the model can make elastic adjustments when resources change to ensure smooth training. [Methods] To meet this urgent need, this article proposes an elastic training mechanism for 3D parallel training and memory optimization, and introduces an elastic training method based on checkpoint into the deep learning framework Megatron-DeepSpeed. [Results] The checkpoint-based elastic training method proposed in this paper is suitable for model training with different resource allocations. By implementing 3D parallel training and memory optimized elastic training strategy on the LLaMA1 model, 8 groups of experiments are set for comparison. The experimental indexes such as loss value change, job completion time, and memory allocation are compared after elastic resource change. [Conclusions] The experimental results show that the model loss value is similar under different parallel degrees after elastic resource changes, which confirms the scalability of the model under different resource configurations. The model can maintain the continuity of model training under resource constraints, and the training speed and performance of the model can be significantly improved by increasing computing resources and parallel degrees under sufficient resources.

Key words: large language model, distributed training, elastic training, checkpoint