基于特征分析的HPC失败作业的检测和根因分析

doi:10.11871/jfdc.issn.2096-742X.2023.06.009

数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (6): 94-103.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.06.009

doi: 10.11871/jfdc.issn.2096-742X.2023.06.009

基于特征分析的HPC失败作业的检测和根因分析

危婷^*(),彭亮,牛铁,张宏海

中国科学院计算机网络信息中心，北京 100083

收稿日期:2022-09-08 出版日期:2023-12-20 发布日期:2023-12-25
通讯作者: *危婷（E-mail: weiting@cnic.cn）
作者简介:危婷，中国科学院计算机网络信息中心，博士，高级工程师，主要研究方向为数据分析、高性能计算、集群监控与分析。
本文中负责撰稿，高性能计算集群监控数据分析、作业失败检测算法研究和根因分析。
Wei Ting, Ph.D., is a senior engineer at the Computer Network Information Center, Chinese Academy of Sciences. Her main research interests include data analysis, HPC, cluster monitoring and analysis technology.
In this paper, she is responsible for the paper writing, HPC cluster monitoring data analysis, job failure detection algorithm, and attribution analysis.
E-mail: weiting@cnic.cn

Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis

WEI Ting^*(),PENG Liang,NIU Tie,ZHANG Honghai

Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China

Received:2022-09-08 Online:2023-12-20 Published:2023-12-25

摘要/Abstract

摘要：

【背景】 在高性能计算系统中，更早、更快地发现计算作业异常及其退出原因，可以帮助用户缩短纠错时间，更有效地使用价格不菲的计算资源。【目的】 为了实现对计算作业异常的预警，快速定位作业失败根因，提高用户使用体验。【方法】 本文基于某超大型超级计算集群的监控数据，针对特定应用分析了运行特征与计算作业运行成败的关系。采用Isolation Forest算法对作业运行时所在计算节点的运行状态进行异常检测，并对作业是否失败进行预测；通过特征分析，同时结合日志和其他故障数据构建HPC作业失败根因图谱。【结果】 通过对算法的数值分析，发现Isolation Forest能够较准确地预测作业失败。基于应用运行特征关联分析构造的根因图谱，可较好地融汇作业运行和资源使用情况的所有影响因子，并展现所有因子的因果关系。【结论】 本文的研究可以帮助高性能计算系统，特别是超大型超级计算系统的管理人员、用户尽早发现计算作业异常，并快速提供问题定位依据，对减少计算资源浪费、提高计算效率具有重要意义。

关键词: 高性能计算, 特征分析, 机器学习, Isolation Forest, 检测, 根因分析

Abstract:

[Background] In high-performance computing systems, earlier and faster detection of computing job failures and their failure reasons can help users shorten the error correction time and use expensive computing resources more effectively. [Objective] In order to realize the early warning of computing job anomalies, quickly locate the root causes of job failures, and improve user experience, [Methods] this paper analyzes the relationship between the running features and the success or failure of computing jobs for specific applications, based on the monitoring data of a very large supercomputing cluster. The Isolation Forest algorithm is used to detect anomalies in the running state of the computing node where the job is running and predict the job failure. Through feature analysis, logs, and other fault data, the root cause map of HPC job failure can be constructed. [Results] It is found that the Isolation Forest algorithm can predict job failure accurately through the numerical analysis of the algorithm. The root cause map constructed based on application feature association analysis can integrate all the influencing factors of job execution and resource usage, and show the relationship of all factors. [Conclusions] The research in this paper can help managers and users of high-performance computing systems, especially ultra-large supercomputing systems, to find job anomalies as soon as possible, and quickly provide problem location, which is of great significance to reduce the waste of computing resources and improve computing efficiency.

Key words: HPC, feature analysis, machine learning, Isolation Forest, detection, root cause analysis

危婷, 彭亮, 牛铁, 张宏海. 基于特征分析的HPC失败作业的检测和根因分析[J]. 数据与计算发展前沿, 2023, 5(6): 94-103.

WEI Ting, PENG Liang, NIU Tie, ZHANG Honghai. Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis[J]. Frontiers of Data and Computing, 2023, 5(6): 94-103, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2023.06.009.

图/表 10

表1

图1

算法1

算法2

图2

图3

图4

表2

图5

图6

参考文献 14

[1]	钱德沛, 王锐. E级计算的几个问题[J]. 中国科学: 信息科学, 2020, 50(9): 1303-1326.
[2]	彭亮, 牛铁, 魏宝亮, 等. 超大规模计算集群监控系统的设计与实现[J]. 数据与计算发展前沿, 2023, 5(1): 97-103.
[3]	地球大数据云服务基础平台[Z/OL]. http://portal.casearth.cn/serviceView?id=2&modelType=1.
[4]	尚家泽, 安葳鹏, 郭耀丹. 基于阈值的BIRCH算法改进与分析[J]. 重庆邮电大学学报(自然科学版), 2020, 43(3): 337-347.
[5]	杜荣浩. 针对大规模时间序列数据的改进聚类算法[D]. 北京: 北京交通大学, 2017.
[6]	高建明. 数据挖掘分类与聚类算法并行化研究[D]. 南京: 东南大学, 2017.
[7]	韩利钊, 钱雪忠, 罗靖, 等. 基于区域划分的DBSCAN多密度聚类算法[J]. 计算机应用研究, 2018, 35(6): 1668-1671.
[8]	王蕾, 乔帅. 作业数据异常检测方法及装置: 中国, CN 106951353A[P]. 2017.07.14.
[9]	薛巍, 杨斌, 邵明山, 等. 大规模集群作业异常检测方法: 中国, CN114116392A[P]. 2022.03.01.
[10]	WUCHERL Y, ALEX S, WU K S. Machine Learning Based Job Status Prediction in Scientifific Clusters[C]. 2016 SAI Computing Conference (SAI), 2016.
[11]	RAKESH K, SAURABH J, ASHRAF M. The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems[C]. IEEE/IFIP International Conference on Dependable Systems & Networks 2020, 2020.
[12]	王克朝, 成坚, 王甜甜, 等. 面向程序分析的插装技术研究[J]. 计算机应用研究, 2015, 32(2): 479-484.
[13]	游伟倩, 盛乐标, 张予倩. 南京大学高性能计算集群系统管理与运维研究[J]. 中国设备工程, 2018, 22:42-45.
[14]	周志华. 机器学习[M]. 北京: 清华大学出版社, 2016: 25-26.

	应用特征	释义
CPU Memory IB	idle	空闲时间占比
	sys	内核态占比
	user	用户态占比（使用率）
	iowait	I/O等待占比
	mem_used	内存使用量
	Swap in	数据从swap 读到ram速率
	Swap out	数据从ram写到swap速率
	port_xmit_packets	每秒发送包数
	port_rcv_packets	每秒接收包数

		presion	recall	f1-score	support	confusion matrix
测试集1	-1	0.60	0.18	0.28	33	[[6 27] [4 4658]]
测试集1	1	0.99	1.00	1.00	4662	[[6 27] [4 4658]]
测试集2	-1	0.50	0.89	0.64	9	[[8 1] [8 252]]
测试集2	1	1.00	0.97	0.98	260	[[8 1] [8 252]]
测试集3	-1	0.20	1.00	0.33	5	[[5 0] [20 330]]
测试集3	1	1.00	0.94	0.97	350	[[5 0] [20 330]]
测试集4	-1	0.17	0.67	0.27	3	[[2 1] [10 4505]]
测试集4	1	1.00	1.00	1.00	4515	[[2 1] [10 4505]]

[1]	张云泉, 袁良, 袁国兴, 李希代. 2023年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2023, 5(6): 1-8.
[2]	杨晨柳, 方安, 王蕾, 王茜, 钱庆. 我国生物医学领域高性能计算发展分析与建议[J]. 数据与计算发展前沿, 2023, 5(6): 104-114.
[3]	申志豪, 李娜, 尹世豪, 杜一, 胡良霖. 基于TPA-Transformer的机票价格预测[J]. 数据与计算发展前沿, 2023, 5(6): 115-125.
[4]	王子元, 王国中. 改进的轻量级YOLOv5算法在行人检测的应用[J]. 数据与计算发展前沿, 2023, 5(6): 161-172.
[5]	赵泽军, 范振峰, 丁博, 夏时洪. 基于增量学习的深度人脸伪造检测[J]. 数据与计算发展前沿, 2023, 5(6): 42-57.
[6]	宋恒, 耿天宝, 王东杰, 张宜声. 一种基于关键信息监督的隧道雷达数据衬线识别算法[J]. 数据与计算发展前沿, 2023, 5(5): 154-163.
[7]	孙一帆, 张锐, 陶杨, 高碧柔, 秦诗涵, 安超. 本地化差分隐私综述[J]. 数据与计算发展前沿, 2023, 5(5): 74-97.
[8]	田一擎, 程曦, 冯博靖. 企业信用评级计算模型综述[J]. 数据与计算发展前沿, 2023, 5(4): 139-153.
[9]	陈美霖, 刘端阳, 徐黎明, 汪洋. 基于机器学习的力场模型研究综述[J]. 数据与计算发展前沿, 2023, 5(4): 27-37.
[10]	刘端阳, 魏钟鸣. 有监督学习算法在材料科学中的应用[J]. 数据与计算发展前沿, 2023, 5(4): 38-47.
[11]	张新昕,刘夏真,梁姗,张鉴,陆忠华,高凌云,张浩源. 高性能并行CFD软件研发及高速列车气动性能预示[J]. 数据与计算发展前沿, 2023, 5(2): 106-118.
[12]	李妍,何洪波,王闰强. 微博热度预测研究综述[J]. 数据与计算发展前沿, 2023, 5(2): 119-135.
[13]	高添,朱教君,张金鑫,孙一荣,于丰源,滕德雄,卢德亮,于立忠,王宗国. 基于新一代信息技术的温带森林生态系统碳通量精准计量[J]. 数据与计算发展前沿, 2023, 5(2): 60-72.
[14]	王凡,冯立强,曹荣强. 大数据驱动的海洋人工智能服务平台设计与应用[J]. 数据与计算发展前沿, 2023, 5(2): 73-85.
[15]	杨雪莹, 李晨, 陈逸东, 陆忠华. 基于数值方法的养老目标基金的模型与算法综述[J]. 数据与计算发展前沿, 2023, 5(1): 85-96.

基于特征分析的HPC失败作业的检测和根因分析

Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 14

相关文章 15

编辑推荐

Metrics

本文评价