Frontiers of Data and Computing ›› 2026, Vol. 8 ›› Issue (3): 203-216.

doi: 10.11871/jfdc.issn.2096-742X.2026.03.017

• Technology and Application • Previous Articles     Next Articles

A survey of Checkpointing Techniques for Large-Scale Language Models

ZHANG Chao1,2(),LI Yanghao1,2,LI Kai1,WANG Zijian1,WANG Yangang1,2,CAO Rongqiang1,2,*()   

  1. 1 Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2 University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2025-08-20 Online:2026-06-20 Published:2026-06-18
  • Contact: CAO Rongqiang E-mail:czhang@cnic.cn;caorq@sccas.cn

Abstract:

[Objective] This study systematically reviews the state of the art and development trends of checkpointing techniques for large language model (LLM) training, summarizes relevant research advances, and outlines future research directions. [Methods] Through in-depth analysis of the full LLM checkpointing lifecycle, we review research advances in core technologies: asynchronous checkpointing, compression strategies, fault-tolerance mechanisms, and cross-heterogeneous resource & framework compatibility. [Results] Current research has established a multi-modal optimization framework based on hierarchical storage architectures and incremental checkpointing, delivering remarkable breakthroughs in checkpoint storage efficiency, recovery latency, and system stability. However, existing approaches still have critical limitations in storage efficiency, fault-tolerance mechanisms, I/O overhead, and adaptability to practical large-scale distributed training, failing to fully meet the rapidly growing demands of cutting-edge LLM development. [Conclusions] To the best of our knowledge, this work is the first to systematically review the key components and core research advances across the full lifecycle of checkpointing technologies for LLM training. Future research should prioritize flexible and reliable checkpointing solutions, high-efficiency storage optimization, intelligent fault-tolerance mechanisms, and cross-resource & cross-framework heterogeneous compatibility, to meet the core demands of large-scale distributed LLM training.

Key words: Large-scale Language Models (LLM), checkpointing techniques, fault tolerance mechanisms, heterogeneous resources and frameworks