Frontiers of Data and Computing ›› 2024, Vol. 6 ›› Issue (1): 57-67.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.01.006

doi: 10.11871/jfdc.issn.2096-742X.2024.01.006

• Technology and Application • Previous Articles     Next Articles

A Monitoring and Diagnosis System for CNGrid

ZHAO Yining*(),XIAO Haili   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2022-09-19 Online:2024-02-20 Published:2024-02-21

Abstract:

[Objective] This paper proposes a monitoring and diagnosing system in the large-scale distributed computing environment. [Context] To improve the services and support the stable operation of the high-performance computing environment, as well as to avoid malfunction resulting from errors and failures, it is necessary to collect information such as logs from the environment so that profiling of the program execution and anomalies can be found. However, the data analyzed are usually in the form of text and numbers, which are not easily understandable to humans. [Methods] This paper demonstrates the monitoring and diagnosis system of CNGrid, which can assess the operation status of the monitored environment through quantification and visualization methods. It gathers data from CNGrid and performs analyses from several angles. [Results] The analyzed results are transformed into rating numbers and visualized figures to enable the operators quickly identify the causes and locations of anomalies. [Conclusions] The major process are automatically performed by the system, thus it greatly reduces manual effort and successfully supports the operation and maintenance works.

Key words: system diagnosing, data processing, quantification, visualization, HPC environment