The Design and Implementation of a Monitoring System for Super-Large Computing Cluster

doi:10.11871/jfdc.issn.2096-742X.2023.01.009

Abstract

Abstract:

[Background] The traditional cluster monitoring systems cannot meet the requirements of multi-clusters and super-large-scale clusters with more than 10000 nodes in performance, flexibility, and scalability. [Objective] It is urgent to develop a new monitoring system to improve the management capability and efficiency for these kinds of clusters. [Methods] This paper adopts message-oriented middleware, distributed monitoring architecture, and REST API to realize a monitoring system for above-mentioned clusters. [Results] The system supports the functions of self-definable metrics, real-time active data sending, and automatic alarm, and is of good extensibility. The system has been deployed in several computing clusters and fits the monitoring needs of the cluster with more than 10000 nodes and devices. The amount of daily data collection is more than 200 GB. [Limitations] Due to numerous kinds of monitoring metrics and mass monitoring data, the data correlation analysis ability for specific business scenarios needs to be improved. [Conclusions] The work presented in this paper meets the need for automatic management of the super-large computing cluster and the multi-cluster systems. It can be a reference in developing the management tools for even larger computing clusters and for the exascale computing systems.

Key words: super-large scale, computing, cluster, HPC, monitoring

PENG Liang, NIU Tie, WEI Baoliang, ZHAO Yi. The Design and Implementation of a Monitoring System for Super-Large Computing Cluster[J]. Frontiers of Data and Computing, 2023, 5(1): 97-103, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2023.01.009.

Figures/Tables 9

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Fig.5

Fig.6

Fig.7

Fig.8

References 10

[1]	金钟, 陆忠华, 李会元, 迟学斌, 孙家昶. 高性能计算之源起—科学计算的应用现状及发展思考[J]. 中国科学院院刊, 2019, 34(6): 625-639.
[2]	历军. 中国超算产业发展现状分析[J]. 中国科学院院刊, 2019, 34(6): 617-624.
[3]	冯圣中, 李根国, 栗学磊, 齐富民, 黄典, 万艺, 吴金城. 新兴高性能计算行业应用及发展战略[J]. 中国科学院院刊, 2019, 34(6): 640-647.
[4]	刘金. 大规模集群状态时序数据采集、存储与分析[D]. 北京: 北京邮电大学, 2018.
[5]	杨杰, 曾凌波, 彭运勇, 蒋迁谦, 杜量. 面向大规模集群的自动化监控系统[J]. 计算机工程与科学, 2020, 42(10):1801-1806.
[6]	地球大数据云服务基础平台[Z]. http://portal.casearth.cn/serviceView?id=2&modelType=1.
[7]	赵月辉. 大规模机群远程监控管理关键技术及实现[D]. 北京: 中国科学院计算技术研究所, 2006.
[8]	胡鹤, 赵毅, 牛铁, 曹荣强. 面向集群服务器系统的监控平台综述[J]. 科研信息化技术与应用, 2018, 9(1):79-88.
[9]	陈晓宇. Prometheus架构与实践分享[EB/OL].[2021-12-20]. https://www.sohu.com/a/342733264_198222.
[10]	陈森, 倪鑫易, 胡大伟. 基于Prometheus的运维管理系统[J]. 产业科技创新, 2020, 2(24): 71-72.

订阅主题	指标类型	对应表名
R.*.load.cpu	cpu相关信息	load-cpu
R.*.load.meminfo	内存相关信息	load-meminfo
R.*.load.net	网络相关信息	load-net
R.*.load.dskio	磁盘读写信息	load-dskio
R.*.load.df	磁盘使用情况	load-df
R.*.sysinfo.sysinfo	系统基本信息	sysinfo-sysinfo
R.*.proc.proclist	进程相关信息	proc-proclist
R.*.load.infiniband	IB高速网络	infiniband-info
R...info	自定义采集项	*-info
R...syslog	事件日志信息	分布式数据库

[1]	ZHU Xiaojie,WANG Huajin,SHEN Zhihong,GUO Xuebing,DONG Wen. End-to-End Workflow Framework for Cross-Center Scientific Data Analysis [J]. Frontiers of Data and Computing, 2023, 5(1): 15-27.
[2]	FAN Shaoping,ZHANG Zhiqiang. The Development and Prospect of Biomedical Informatics Driven by Data and Technology [J]. Frontiers of Data and Computing, 2023, 5(1): 41-54.
[3]	YANG Xueying, LI Chen, CHEN Yidong, LU Zhonghua. A Survey of Models and Algorithms of Numerical Methods Based on Pension Target Funds [J]. Frontiers of Data and Computing, 2023, 5(1): 85-96.
[4]	LI Wei,WANG Yumeng,LI Yan,ZHAO Weibo,SU Yue. Analysis of the Impact of Computing Power Services on the Development of Computing Power Economy Under the Background of “ East-West Computing Requirement Transfer” [J]. Frontiers of Data and Computing, 2022, 4(6): 13-19.
[5]	ZHOU Gefan,LEI Bo. Computing Service Demand Matching Based on Computing Power Identification in Computing Power Network [J]. Frontiers of Data and Computing, 2022, 4(6): 20-28.
[6]	JIN Tianjiao,LI Wei. An Intelligent Scheduling and Allocation Method of Big Data Computing Resources Based on Computing Power Network [J]. Frontiers of Data and Computing, 2022, 4(6): 29-37.
[7]	ZHENG Xinfeng,WANG Jianjun,HUANG Jingbin,RAO Qiang,PAN Jinmu,YE Qingdan. Research on “Three-System and Five-Trust” Computing Power Network Operation System Based on Blockchain Technology [J]. Frontiers of Data and Computing, 2022, 4(6): 38-54.
[8]	YE Qindan,FAN Guisheng,Huang Hengyang. Study on Integrated Support Solution and Application Scenarios of Computing Power Network [J]. Frontiers of Data and Computing, 2022, 4(6): 55-66.
[9]	CHENG Ying,PEI Xiaoyan. Computing Power Scheduling of Cloud-Enabled Infrastructure Based on Inter-Cloud and Multi-Cloud [J]. Frontiers of Data and Computing, 2022, 4(5): 11-22.
[10]	YI Xinxin,MA Herong,CAO Chang,TANG Xiongyan. Analysis and Discussion of Routing Strategy for Programmable Services in Computing Power Network [J]. Frontiers of Data and Computing, 2022, 4(5): 23-32.
[11]	KOU Dazhi, WEI Jianwen, TANG Xiaoyong. Application-Aware Method for Optimized Computing Power Scheduling [J]. Frontiers of Data and Computing, 2022, 4(5): 3-10.
[12]	WANG Xiaoning,LU Shasha,WU Can,He Rong,YAN Xiaoting,XIAO Haili,CHI Xuebin. HPC Computing Power Programming Paradigm Based on CNGrid [J]. Frontiers of Data and Computing, 2022, 4(5): 33-41.
[13]	ZHOU Yong,GUO Zhuanzhuan,ZENG Qin,HAO Yiyi. The Impact of “East-West Computing Requirement Transfer” on Meteorological Industry [J]. Frontiers of Data and Computing, 2022, 4(5): 42-49.
[14]	YANG Xin,SHEN Wenhai. Meteorological Computing Power and Service Architecture under “East-West Computing Requirement Transfer” [J]. Frontiers of Data and Computing, 2022, 4(5): 50-59.
[15]	ZHAO Jianan,HU Xiaohui,DU Xinxin. Spectrum Resource Allocation of Vehicle Edge Computing Network Based on Proximal Policy Optimization Algorithm [J]. Frontiers of Data and Computing, 2022, 4(4): 142-155.