基于XGBoost模型的超算作业运行状态预测研究

doi:10.11871/jfdc.issn.2096-742X.2024.06.012

数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (6): 123-129.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.06.012

doi: 10.11871/jfdc.issn.2096-742X.2024.06.012

基于XGBoost模型的超算作业运行状态预测研究

纪鹏^1,^2,^*(),牛铁¹,危婷¹,彭亮¹

1.中国科学院计算机网络信息中心，北京 100083
2.中国科学院大学，计算机科学与技术学院，北京 100049

收稿日期:2024-03-07 出版日期:2024-12-20 发布日期:2024-12-20
通讯作者: 纪鹏
作者简介:纪鹏，中国科学院计算机网络信息中心，硕士研究生，主要研究方向为数据挖掘。
本文承担主要工作：论文撰写和高性能作业状态预测技术研究。
JI Peng is a master's student at the Computer Network Information Center, Chinese Academy of Sciences. His main research interest is data mining.
In this paper, he is responsible for the paper writing and HPC job state prediction technology research.|E-mail: jipeng150065@163.com
基金资助:
中国科学院网络安全和信息化专项(CAS-WX2022GC-0103)

Research on Supercomputer Job Running State Prediction Based on XGBoost Model

JI Peng^1,^2,^*(),NIU Tie¹,WEI Ting¹,PENG Liang¹

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
2. School of Computer Science and Technology, Chinese Academy of Sciences, Beijing 100049, China

Received:2024-03-07 Online:2024-12-20 Published:2024-12-20
Contact: JI Peng

摘要/Abstract

摘要：

【背景】在高性能计算系统中，作业运行一段时间后可能失败或者异常退出，导致计算资源被占用但未得到满意结果。【目的】对计算作业异常运行状态的检测和预警可以帮助用户、管理人员提前介入干预，减少资源浪费，更早和更好地跟踪分析异常原因。【方法】本文基于大型超级计算集群真实监控数据，从作业运行状态和特征的角度，采用XGBoost算法对各类型作业的运行状态进行异常检测，并对作业是否失败进行预测。【结果】通过对算法的比较和分析，发现XGBoost能够较准确地预测作业失败。【结论】本文研究为高性能计算作业的异常检测和预警拓展了一种新的研究思路，对帮助用户更高效使用昂贵的超级计算资源具有积极意义。

关键词: 高性能计算, 作业状态预测, 机器学习

Abstract:

[Background] In high-performance computing systems, jobs may fail or exit abnormally after running for a period of time, resulting in computational resources being utilized without satisfactory results. [Objective] Detection and early warning of abnormal operation status of computing jobs can help users and managers to intervene in advance, reduce the waste of resources, and track and analyze the causes of abnormalities earlier and better. [Methods] Based on real monitoring data of large supercomputing clusters, the XGBoost algorithm is used to detect anomalies in the operation status of each type of job and predict whether the job fails or not from the operation status and characteristics of the job. [Results] By comparing and analyzing the algorithms, it is found that XGBoost can predict job failure more accurately. [Conclusions] The research in this paper explores a new research idea for anomaly detection and early warning of high performance computing jobs, which is of positive significance to help users to use expensive supercomputing resources more efficiently.

Key words: HPC, job status prediction, machine learning

纪鹏,牛铁,危婷,彭亮. 基于XGBoost模型的超算作业运行状态预测研究[J]. 数据与计算发展前沿, 2024, 6(6): 123-129.

JI Peng,NIU Tie,WEI Ting,PENG Liang. Research on Supercomputer Job Running State Prediction Based on XGBoost Model[J]. Frontiers of Data and Computing, 2024, 6(6): 123-129, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.06.012.

图/表 9

表1

图1

表2

图2

图3

图4

图5

图6

图7

参考文献 10

[1]	LIU C, HAN J J, SHANG Y, et al. Predicting of Job Failure in Compute Cloud Based on Online Extreme Learning Machine: A Comparative Study[J]. IEEE Access, 2017, 5: 9359-9368.
[2]	LIANG Y, ZHANG Y, et al. BlueGene/L Failure Analysis and Prediction Models[C]. International Conference on Dependable Systems and Networks (DSN'06), 2006: 425-434.
[3]	NAKKA N, AGRAWAL A, CHOUDHARY A. Predicting Node Failure in High Performance Computing Systems from Failure and Usage Logs[C/OL]. 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, Anchorage, AK, USA, 2011: 557-1566.
[4]	ROSA A, CHEN L Y, BINDER W. Predicting and Mitigating Jobs Failures in Big Data Clusters[C]. IEEE/ACM International Symposium on Cluster,IEEE, 2015: 221-230.
[5]	FADISHEI H, SAADATFAR H, DELDARI H. Job failure prediction in grid environment based on workload characteristics[C]. 2009 14th International CSI Computer Conference, Tehran, Iran, 2009: 329-334.
[6]	刘春红, 韩晶晶, 商彦磊. 基于SVM分类的云集群失败作业主动预测方法[J]. 北京邮电大学学报, 2016, 39(5): 104-109.
[7]	BANJONGKAN A, PONGSENA W, et al. A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System: Using Decision Tree Algorithms[J/OL]. Journal of Advances in Information Technology, 2021: 84-92. http://dx.doi.org/10.12720/jait.12.2.84-92.
[8]	CHEN X, LU C D, PATTABIRAMAN K. Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study[C/OL]. 2014 IEEE International Symposium on Software Reliability Engineering Workshops, Naples, Italy, 2014: 341-346.
[9]	危婷, 彭亮, 牛铁, 等. 基于特征分析的HPC失败作业的检测和根因分析[J]. 数据与计算发展前沿, 2023, 5(6): 94-103.
[10]	唐阳坤. 基于可视化技术的高性能集群监控数据分析[D]. 西南科技大学, 2022.

	名称	含义
CPU	iowait	IO等待时间
	idle	CPU空闲状态比例
	Sys	内核态占比
memory	mem_used	内存已用量
	mem_ratio	内存比率
IB	port_rcv_packets	下载速率
	port_xmit_packets	每秒发送包数
	port_xmit_wait	每秒等待发送次数
Disk	disk_all_read	读取速率
	disk_all_write	写入速率
	disk_all_svctm	每秒io服务时间
	disk_all_await	每秒io等待时间

类别	训练集	测试集
正常完成	870	222
异常完成	721	176
总计	1591	398

[1]	张云泉,袁良,袁国兴,李希代. 2024年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2024, 6(6): 1-9.
[2]	张斌,李晨,陆忠华. 中国股票市场风险因子研究综述[J]. 数据与计算发展前沿, 2024, 6(6): 146-159.
[3]	龙春, 李丽莎, 李婧, 杨帆, 魏金侠, 付豫豪. 机器学习安全推理研究综述[J]. 数据与计算发展前沿, 2024, 6(5): 1-12.
[4]	武傲, 李天颜, 张宝花, 徐顺, 刘倩. 基于高性能计算环境的科学应用平台工作流设计与实现[J]. 数据与计算发展前沿, 2024, 6(4): 150-162.
[5]	郭学兵, 朱小杰, 唐新斋, 杨刚, 侯艳飞, 何洪林. 基于大数据流水线系统的算法模型整合方法研究——以基于机器学习方法的LiDAR数据树木生物量反演为例[J]. 数据与计算发展前沿, 2024, 6(4): 96-105.
[6]	陈晔峰, 晏臣, 陈锋, 安卫士, 何明扬. 基于鲲鹏处理器的WRF移植与评估[J]. 数据与计算发展前沿, 2024, 6(3): 150-161.
[7]	何睿琳, 杨欣怡, 孙洪赞, 李晨. 基于图特征的组织病理学图像分析方法的最新发展情况与展望[J]. 数据与计算发展前沿, 2024, 6(2): 101-116.
[8]	叶旭, 杜一, 崔文娟, 沈俊杰, 谢靖, 王露笛. 机器学习技术在眼健康领域的应用[J]. 数据与计算发展前沿, 2024, 6(2): 117-133.
[9]	赵一宁, 肖海力. 国家高性能计算环境运行状态诊断系统[J]. 数据与计算发展前沿, 2024, 6(1): 57-67.
[10]	张浩源, 马文鹏, 袁武, 张鉴, 陆忠华. 面向GPU架构的CCFD-KSSolver组件设计和实现[J]. 数据与计算发展前沿, 2024, 6(1): 68-78.
[11]	张云泉, 袁良, 袁国兴, 李希代. 2023年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2023, 5(6): 1-8.
[12]	杨晨柳, 方安, 王蕾, 王茜, 钱庆. 我国生物医学领域高性能计算发展分析与建议[J]. 数据与计算发展前沿, 2023, 5(6): 104-114.
[13]	申志豪, 李娜, 尹世豪, 杜一, 胡良霖. 基于TPA-Transformer的机票价格预测[J]. 数据与计算发展前沿, 2023, 5(6): 115-125.
[14]	危婷, 彭亮, 牛铁, 张宏海. 基于特征分析的HPC失败作业的检测和根因分析[J]. 数据与计算发展前沿, 2023, 5(6): 94-103.
[15]	孙一帆, 张锐, 陶杨, 高碧柔, 秦诗涵, 安超. 本地化差分隐私综述[J]. 数据与计算发展前沿, 2023, 5(5): 74-97.

基于XGBoost模型的超算作业运行状态预测研究

Research on Supercomputer Job Running State Prediction Based on XGBoost Model

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 10

相关文章 15

编辑推荐

Metrics

本文评价