Research on Supercomputer Job Running State Prediction Based on XGBoost Model

doi:10.11871/jfdc.issn.2096-742X.2024.06.012

Abstract

Abstract:

[Background] In high-performance computing systems, jobs may fail or exit abnormally after running for a period of time, resulting in computational resources being utilized without satisfactory results. [Objective] Detection and early warning of abnormal operation status of computing jobs can help users and managers to intervene in advance, reduce the waste of resources, and track and analyze the causes of abnormalities earlier and better. [Methods] Based on real monitoring data of large supercomputing clusters, the XGBoost algorithm is used to detect anomalies in the operation status of each type of job and predict whether the job fails or not from the operation status and characteristics of the job. [Results] By comparing and analyzing the algorithms, it is found that XGBoost can predict job failure more accurately. [Conclusions] The research in this paper explores a new research idea for anomaly detection and early warning of high performance computing jobs, which is of positive significance to help users to use expensive supercomputing resources more efficiently.

Key words: HPC, job status prediction, machine learning

JI Peng,NIU Tie,WEI Ting,PENG Liang. Research on Supercomputer Job Running State Prediction Based on XGBoost Model[J]. Frontiers of Data and Computing, 2024, 6(6): 123-129, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.06.012.

Figures/Tables 9

Table 1

Fig.1

Table 2

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

References 10

[1]	LIU C, HAN J J, SHANG Y, et al. Predicting of Job Failure in Compute Cloud Based on Online Extreme Learning Machine: A Comparative Study[J]. IEEE Access, 2017, 5: 9359-9368.
[2]	LIANG Y, ZHANG Y, et al. BlueGene/L Failure Analysis and Prediction Models[C]. International Conference on Dependable Systems and Networks (DSN'06), 2006: 425-434.
[3]	NAKKA N, AGRAWAL A, CHOUDHARY A. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs[C/OL]. 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, Anchorage, AK, USA, 2011: 557-1566.
[4]	ROSA A, CHEN L Y, BINDER W. Predicting and Mitigating Jobs Failures in Big Data Clusters[C]. IEEE/ACM International Symposium on Cluster,IEEE, 2015: 221-230.
[5]	FADISHEI H, SAADATFAR H, DELDARI H. Job failure prediction in grid environment based on workload characteristics[C]. 2009 14th International CSI Computer Conference, Tehran, Iran, 2009: 329-334.
[6]	刘春红, 韩晶晶, 商彦磊. 基于SVM分类的云集群失败作业主动预测方法[J]. 北京邮电大学学报, 2016, 39(5): 104-109.
[7]	BANJONGKAN A, PONGSENA W, et al. A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System: Using Decision Tree Algorithms[J/OL]. Journal of Advances in Information Technology, 2021: 84-92. http://dx.doi.org/10.12720/jait.12.2.84-92.
[8]	CHEN X, LU C D, PATTABIRAMAN K. Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study[C/OL]. 2014 IEEE International Symposium on Software Reliability Engineering Workshops, Naples, Italy, 2014: 341-346.
[9]	危婷, 彭亮, 牛铁, 等. 基于特征分析的HPC失败作业的检测和根因分析[J]. 数据与计算发展前沿, 2023, 5(6): 94-103.
[10]	唐阳坤. 基于可视化技术的高性能集群监控数据分析[D]. 西南科技大学, 2022.

	名称	含义
CPU	iowait	IO等待时间
	idle	CPU空闲状态比例
	Sys	内核态占比
memory	mem_used	内存已用量
	mem_ratio	内存比率
IB	port_rcv_packets	下载速率
	port_xmit_packets	每秒发送包数
	port_xmit_wait	每秒等待发送次数
Disk	disk_all_read	读取速率
	disk_all_write	写入速率
	disk_all_svctm	每秒io服务时间
	disk_all_await	每秒io等待时间

类别	训练集	测试集
正常完成	870	222
异常完成	721	176
总计	1591	398

[1]	ZHANG Bin,LI Chen,LU Zhonghua. A Survey of Research on Risk Factors in the Chinese Stock Market [J]. Frontiers of Data and Computing, 2024, 6(6): 146-159.
[2]	LONG Chun, LI Lisha, LI Jing, YANG Fan, WEI Jinxia, Fu Yuhao. Review of Research on Secure Inference in Machine Learning [J]. Frontiers of Data and Computing, 2024, 6(5): 1-12.
[3]	GUO Xuebing, ZHU Xiaojie, TANG Xinzhai, YANG Gang, HOU Yanfei, HE Honglin. Study on Integration Method of Algorithm Model Based on Big Data Pipeline— Taking Tree Biomass Inversion Based on Machine Learning Method and LiDAR Data as an Example [J]. Frontiers of Data and Computing, 2024, 6(4): 96-105.
[4]	HE Ruilin, YANG Xinyi, SUN Hongzan, LI Chen. The Latest Development and Prospects of Histopathological Image Analysis Methods Based on Graph Features [J]. Frontiers of Data and Computing, 2024, 6(2): 101-116.
[5]	YE Xu, DU Yi, CUI Wenjuan, SHEN Junjie, XIE Jing, WANG Ludi. Application of Machine Learning Technology in the Field of Eye Health [J]. Frontiers of Data and Computing, 2024, 6(2): 117-133.
[6]	ZHAO Yining, XIAO Haili. A Monitoring and Diagnosis System for CNGrid [J]. Frontiers of Data and Computing, 2024, 6(1): 57-67.
[7]	SHEN Zhihao, LI Na, YIN Shihao, DU Yi, HU Lianglin. Airfare Price Prediction Based on TPA-Transformer [J]. Frontiers of Data and Computing, 2023, 5(6): 115-125.
[8]	WEI Ting, PENG Liang, NIU Tie, ZHANG Honghai. Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis [J]. Frontiers of Data and Computing, 2023, 5(6): 94-103.
[9]	SUN Yifan, ZHANG Rui, TAO Yang, GAO Birou, QIN Shihan, AN Chao. A Survey on Local Differential Privacy [J]. Frontiers of Data and Computing, 2023, 5(5): 74-97.
[10]	TIAN Yiqing, CHENG Xi, FENG Bojing. A Review of Computational Models for Corporate Credit Rating [J]. Frontiers of Data and Computing, 2023, 5(4): 139-153.
[11]	CHEN Meilin, LIU Duanyang, XU Liming, WANG Yang. A Review of Force Field Models Based on Machine Learning [J]. Frontiers of Data and Computing, 2023, 5(4): 27-37.
[12]	LIU Duanyang, WEI Zhongming. Application of Supervised Learning Algorithms in Materials Science [J]. Frontiers of Data and Computing, 2023, 5(4): 38-47.
[13]	LI Yan,HE Hongbo,WANG Runqiang. A Survey of Research on Microblog Popularity Prediction [J]. Frontiers of Data and Computing, 2023, 5(2): 119-135.
[14]	GAO Tian,ZHU Jiaojun,ZHANG Jinxin,SUN Yirong,YU Fengyuan,TENG Dexiong,LU Deliang,YU Lizhong,WANG Zongguo. Estimation of Carbon Flux of a Temperate Forest Ecosystem Based on Next-Generation Information Technologies [J]. Frontiers of Data and Computing, 2023, 5(2): 60-72.
[15]	WANG Fan,FENG Liqiang,CAO Rongqiang. Design and Application of Big Data-Driven Ocean Artificial Intelligence Service Platform [J]. Frontiers of Data and Computing, 2023, 5(2): 73-85.