Frontiers of Data and Computing ›› 2023, Vol. 5 ›› Issue (1): 97-103.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.01.009

doi: 10.11871/jfdc.issn.2096-742X.2023.01.009

• Technology and Application • Previous Articles     Next Articles

The Design and Implementation of a Monitoring System for Super-Large Computing Cluster

PENG Liang*(),NIU Tie,WEI Baoliang,ZHAO Yi   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2022-01-27 Online:2023-02-20 Published:2023-02-20

Abstract:

[Background] The traditional cluster monitoring systems cannot meet the requirements of multi-clusters and super-large-scale clusters with more than 10000 nodes in performance, flexibility, and scalability. [Objective] It is urgent to develop a new monitoring system to improve the management capability and efficiency for these kinds of clusters. [Methods] This paper adopts message-oriented middleware, distributed monitoring architecture, and REST API to realize a monitoring system for above-mentioned clusters. [Results] The system supports the functions of self-definable metrics, real-time active data sending, and automatic alarm, and is of good extensibility. The system has been deployed in several computing clusters and fits the monitoring needs of the cluster with more than 10000 nodes and devices. The amount of daily data collection is more than 200 GB. [Limitations] Due to numerous kinds of monitoring metrics and mass monitoring data, the data correlation analysis ability for specific business scenarios needs to be improved. [Conclusions] The work presented in this paper meets the need for automatic management of the super-large computing cluster and the multi-cluster systems. It can be a reference in developing the management tools for even larger computing clusters and for the exascale computing systems.

Key words: super-large scale, computing, cluster, HPC, monitoring