数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (6): 123-129.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.06.012

doi: 10.11871/jfdc.issn.2096-742X.2024.06.012

• • 上一篇    下一篇

基于XGBoost模型的超算作业运行状态预测研究

纪鹏1,2,*(),牛铁1,危婷1,彭亮1   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,计算机科学与技术学院,北京 100049
  • 收稿日期:2024-03-07 出版日期:2024-12-20 发布日期:2024-12-20
  • 通讯作者: 纪鹏
  • 作者简介:纪鹏,中国科学院计算机网络信息中心,硕士研究生,主要研究方向为数据挖掘。
    本文承担主要工作:论文撰写和高性能作业状态预测技术研究。
    JI Peng is a master's student at the Computer Network Information Center, Chinese Academy of Sciences. His main research interest is data mining.
    In this paper, he is responsible for the paper writing and HPC job state prediction technology research.|E-mail: jipeng150065@163.com
  • 基金资助:
    中国科学院网络安全和信息化专项(CAS-WX2022GC-0103)

Research on Supercomputer Job Running State Prediction Based on XGBoost Model

JI Peng1,2,*(),NIU Tie1,WEI Ting1,PENG Liang1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. School of Computer Science and Technology, Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-03-07 Online:2024-12-20 Published:2024-12-20
  • Contact: JI Peng

摘要:

【背景】在高性能计算系统中,作业运行一段时间后可能失败或者异常退出,导致计算资源被占用但未得到满意结果。【目的】对计算作业异常运行状态的检测和预警可以帮助用户、管理人员提前介入干预,减少资源浪费,更早和更好地跟踪分析异常原因。【方法】本文基于大型超级计算集群真实监控数据,从作业运行状态和特征的角度,采用XGBoost算法对各类型作业的运行状态进行异常检测,并对作业是否失败进行预测。【结果】通过对算法的比较和分析,发现XGBoost能够较准确地预测作业失败。【结论】本文研究为高性能计算作业的异常检测和预警拓展了一种新的研究思路,对帮助用户更高效使用昂贵的超级计算资源具有积极意义。

关键词: 高性能计算, 作业状态预测, 机器学习

Abstract:

[Background] In high-performance computing systems, jobs may fail or exit abnormally after running for a period of time, resulting in computational resources being utilized without satisfactory results. [Objective] Detection and early warning of abnormal operation status of computing jobs can help users and managers to intervene in advance, reduce the waste of resources, and track and analyze the causes of abnormalities earlier and better. [Methods] Based on real monitoring data of large supercomputing clusters, the XGBoost algorithm is used to detect anomalies in the operation status of each type of job and predict whether the job fails or not from the operation status and characteristics of the job. [Results] By comparing and analyzing the algorithms, it is found that XGBoost can predict job failure more accurately. [Conclusions] The research in this paper explores a new research idea for anomaly detection and early warning of high performance computing jobs, which is of positive significance to help users to use expensive supercomputing resources more efficiently.

Key words: HPC, job status prediction, machine learning