Frontiers of Data and Computing ›› 2024, Vol. 6 ›› Issue (6): 123-129.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.06.012

doi: 10.11871/jfdc.issn.2096-742X.2024.06.012

Previous Articles     Next Articles

Research on Supercomputer Job Running State Prediction Based on XGBoost Model

JI Peng1,2,*(),NIU Tie1,WEI Ting1,PENG Liang1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. School of Computer Science and Technology, Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-03-07 Online:2024-12-20 Published:2024-12-20
  • Contact: JI Peng E-mail:jipeng150065@163.com

Abstract:

[Background] In high-performance computing systems, jobs may fail or exit abnormally after running for a period of time, resulting in computational resources being utilized without satisfactory results. [Objective] Detection and early warning of abnormal operation status of computing jobs can help users and managers to intervene in advance, reduce the waste of resources, and track and analyze the causes of abnormalities earlier and better. [Methods] Based on real monitoring data of large supercomputing clusters, the XGBoost algorithm is used to detect anomalies in the operation status of each type of job and predict whether the job fails or not from the operation status and characteristics of the job. [Results] By comparing and analyzing the algorithms, it is found that XGBoost can predict job failure more accurately. [Conclusions] The research in this paper explores a new research idea for anomaly detection and early warning of high performance computing jobs, which is of positive significance to help users to use expensive supercomputing resources more efficiently.

Key words: HPC, job status prediction, machine learning