Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (1): 135-151.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.01.010

doi: 10.11871/jfdc.issn.2096-742X.2025.01.010

• Technology and Application • Previous Articles     Next Articles

Research on Checkpoint-Based Elastic Training Methods for Large Language Models

WANG Zijian1,2(),LI Kai1,CAO Rongqiang1,ZHOU Chunbao1,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 101408, China
  • Received:2024-06-30 Online:2025-02-20 Published:2025-02-21

Abstract:

[Objective] In view of the huge demand for computing resources in training the large language models, the elastic training method of distributed training framework is crucial to ensure that the model can make elastic adjustments when resources change to ensure smooth training. [Methods] To meet this urgent need, this article proposes an elastic training mechanism for 3D parallel training and memory optimization, and introduces an elastic training method based on checkpoint into the deep learning framework Megatron-DeepSpeed. [Results] The checkpoint-based elastic training method proposed in this paper is suitable for model training with different resource allocations. By implementing 3D parallel training and memory optimized elastic training strategy on the LLaMA1 model, 8 groups of experiments are set for comparison. The experimental indexes such as loss value change, job completion time, and memory allocation are compared after elastic resource change. [Conclusions] The experimental results show that the model loss value is similar under different parallel degrees after elastic resource changes, which confirms the scalability of the model under different resource configurations. The model can maintain the continuity of model training under resource constraints, and the training speed and performance of the model can be significantly improved by increasing computing resources and parallel degrees under sufficient resources.

Key words: large language model, distributed training, elastic training, checkpoint