Frontiers of Data and Computing ›› 2023, Vol. 5 ›› Issue (6): 94-103.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.06.009

doi: 10.11871/jfdc.issn.2096-742X.2023.06.009

Previous Articles     Next Articles

Detection and Root Cause Analysis of HPC Failure Jobs Based on Feature Analysis

WEI Ting*(),PENG Liang,NIU Tie,ZHANG Honghai   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2022-09-08 Online:2023-12-20 Published:2023-12-25

Abstract:

[Background] In high-performance computing systems, earlier and faster detection of computing job failures and their failure reasons can help users shorten the error correction time and use expensive computing resources more effectively. [Objective] In order to realize the early warning of computing job anomalies, quickly locate the root causes of job failures, and improve user experience, [Methods] this paper analyzes the relationship between the running features and the success or failure of computing jobs for specific applications, based on the monitoring data of a very large supercomputing cluster. The Isolation Forest algorithm is used to detect anomalies in the running state of the computing node where the job is running and predict the job failure. Through feature analysis, logs, and other fault data, the root cause map of HPC job failure can be constructed. [Results] It is found that the Isolation Forest algorithm can predict job failure accurately through the numerical analysis of the algorithm. The root cause map constructed based on application feature association analysis can integrate all the influencing factors of job execution and resource usage, and show the relationship of all factors. [Conclusions] The research in this paper can help managers and users of high-performance computing systems, especially ultra-large supercomputing systems, to find job anomalies as soon as possible, and quickly provide problem location, which is of great significance to reduce the waste of computing resources and improve computing efficiency.

Key words: HPC, feature analysis, machine learning, Isolation Forest, detection, root cause analysis