数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (6): 94-103.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.06.009

doi: 10.11871/jfdc.issn.2096-742X.2023.06.009

• • 上一篇    下一篇

基于特征分析的HPC失败作业的检测和根因分析

危婷*(),彭亮,牛铁,张宏海   

  1. 中国科学院计算机网络信息中心,北京 100083
  • 收稿日期:2022-09-08 出版日期:2023-12-20 发布日期:2023-12-25
  • 通讯作者: 危婷(E-mail: weiting@cnic.cn
  • 作者简介:危婷,中国科学院计算机网络信息中心,博士,高级工程师,主要研究方向为数据分析、高性能计算、集群监控与分析。
    本文中负责撰稿,高性能计算集群监控数据分析、作业失败检测算法研究和根因分析。
    Wei Ting, Ph.D., is a senior engineer at the Computer Network Information Center, Chinese Academy of Sciences. Her main research interests include data analysis, HPC, cluster monitoring and analysis technology.
    In this paper, she is responsible for the paper writing, HPC cluster monitoring data analysis, job failure detection algorithm, and attribution analysis.
    E-mail: weiting@cnic.cn

Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis

WEI Ting*(),PENG Liang,NIU Tie,ZHANG Honghai   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2022-09-08 Online:2023-12-20 Published:2023-12-25

摘要:

【背景】 在高性能计算系统中,更早、更快地发现计算作业异常及其退出原因,可以帮助用户缩短纠错时间,更有效地使用价格不菲的计算资源。【目的】 为了实现对计算作业异常的预警,快速定位作业失败根因,提高用户使用体验。【方法】 本文基于某超大型超级计算集群的监控数据,针对特定应用分析了运行特征与计算作业运行成败的关系。采用Isolation Forest算法对作业运行时所在计算节点的运行状态进行异常检测,并对作业是否失败进行预测;通过特征分析,同时结合日志和其他故障数据构建HPC作业失败根因图谱。【结果】 通过对算法的数值分析,发现Isolation Forest能够较准确地预测作业失败。基于应用运行特征关联分析构造的根因图谱,可较好地融汇作业运行和资源使用情况的所有影响因子,并展现所有因子的因果关系。【结论】 本文的研究可以帮助高性能计算系统,特别是超大型超级计算系统的管理人员、用户尽早发现计算作业异常,并快速提供问题定位依据,对减少计算资源浪费、提高计算效率具有重要意义。

关键词: 高性能计算, 特征分析, 机器学习, Isolation Forest, 检测, 根因分析

Abstract:

[Background] In high-performance computing systems, earlier and faster detection of computing job failures and their failure reasons can help users shorten the error correction time and use expensive computing resources more effectively. [Objective] In order to realize the early warning of computing job anomalies, quickly locate the root causes of job failures, and improve user experience, [Methods] this paper analyzes the relationship between the running features and the success or failure of computing jobs for specific applications, based on the monitoring data of a very large supercomputing cluster. The Isolation Forest algorithm is used to detect anomalies in the running state of the computing node where the job is running and predict the job failure. Through feature analysis, logs, and other fault data, the root cause map of HPC job failure can be constructed. [Results] It is found that the Isolation Forest algorithm can predict job failure accurately through the numerical analysis of the algorithm. The root cause map constructed based on application feature association analysis can integrate all the influencing factors of job execution and resource usage, and show the relationship of all factors. [Conclusions] The research in this paper can help managers and users of high-performance computing systems, especially ultra-large supercomputing systems, to find job anomalies as soon as possible, and quickly provide problem location, which is of great significance to reduce the waste of computing resources and improve computing efficiency.

Key words: HPC, feature analysis, machine learning, Isolation Forest, detection, root cause analysis