超大规模计算集群监控系统的设计与实现

doi:10.11871/jfdc.issn.2096-742X.2023.01.009

数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (1): 97-103.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.01.009

doi: 10.11871/jfdc.issn.2096-742X.2023.01.009

超大规模计算集群监控系统的设计与实现

彭亮^*(),牛铁,魏宝亮,赵毅

中国科学院计算机网络信息中心，北京 100083

收稿日期:2022-01-27 出版日期:2023-02-20 发布日期:2023-02-20
通讯作者: * 彭亮（E-mail: pengliang@cnic.cn）
作者简介:彭亮，中国科学院计算机网络信息中心，科技云运行与技术发展部，工程师，主要研究方向为高性能计算系统、集群监控与分析。
文中主要承担工作：监控系统开发与设计。
PENG Liang is an engineer at the Department of Science and Technology Cloud Department, Computer Network Information Center, Chinese Academy of Sciences. Her research interests include high-performance computing, cluster monitoring, and analysis technology.
In this paper, she is responsible for the development and design of the monitoring system.
E-mail: pengliang@cnic.cn
基金资助:
中国科学院战略性先导科技专项项目（A类）(XDA19020101)

The Design and Implementation of a Monitoring System for Super-Large Computing Cluster

PENG Liang^*(),NIU Tie,WEI Baoliang,ZHAO Yi

Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China

Received:2022-01-27 Online:2023-02-20 Published:2023-02-20

摘要/Abstract

摘要：

【背景】传统集群监控软件在性能、灵活性、可扩展性上无法满足超过10000节点的超大规模计算集群以及多集群系统的监控管理需求。【目的】亟需设计研发新型集群监控系统，提升超大规模计算集群和多集群的运行管理能力与效率。【方法】本文采用总分架构设计，利用消息中间件、分布式存储、REST技术实现了一种超大规模计算集群监控系统。【结果】该系统支持监控指标自定义、数据主动上发、自动告警等功能，具有良好的横向扩展能力。已部署于多套计算集群中，满足上万节点和设备的监控需求，日均采集数据逾200GB。【局限】由于监控指标繁多、监控数据量庞大，针对业务场景的数据关联分析能力有待提升。【结论】本文工作满足了超大规模计算集群及异地多集群系统的自动运管需求，采用的方法对更大规模集群甚至E级计算系统的运管工具的研发具有积极借鉴意义。

关键词: 超大规模, 计算, 集群, HPC, 监控

Abstract:

[Background] The traditional cluster monitoring systems cannot meet the requirements of multi-clusters and super-large-scale clusters with more than 10000 nodes in performance, flexibility, and scalability. [Objective] It is urgent to develop a new monitoring system to improve the management capability and efficiency for these kinds of clusters. [Methods] This paper adopts message-oriented middleware, distributed monitoring architecture, and REST API to realize a monitoring system for above-mentioned clusters. [Results] The system supports the functions of self-definable metrics, real-time active data sending, and automatic alarm, and is of good extensibility. The system has been deployed in several computing clusters and fits the monitoring needs of the cluster with more than 10000 nodes and devices. The amount of daily data collection is more than 200 GB. [Limitations] Due to numerous kinds of monitoring metrics and mass monitoring data, the data correlation analysis ability for specific business scenarios needs to be improved. [Conclusions] The work presented in this paper meets the need for automatic management of the super-large computing cluster and the multi-cluster systems. It can be a reference in developing the management tools for even larger computing clusters and for the exascale computing systems.

Key words: super-large scale, computing, cluster, HPC, monitoring

彭亮, 牛铁, 魏宝亮, 赵毅. 超大规模计算集群监控系统的设计与实现[J]. 数据与计算发展前沿, 2023, 5(1): 97-103.

PENG Liang, NIU Tie, WEI Baoliang, ZHAO Yi. The Design and Implementation of a Monitoring System for Super-Large Computing Cluster[J]. Frontiers of Data and Computing, 2023, 5(1): 97-103, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2023.01.009.

图/表 9

图1

图2

图3

图4

表1

图5

图6

图7

图8

参考文献 10

[1]	金钟, 陆忠华, 李会元, 迟学斌, 孙家昶. 高性能计算之源起—科学计算的应用现状及发展思考[J]. 中国科学院院刊, 2019, 34(6): 625-639.
[2]	历军. 中国超算产业发展现状分析[J]. 中国科学院院刊, 2019, 34(6): 617-624.
[3]	冯圣中, 李根国, 栗学磊, 齐富民, 黄典, 万艺, 吴金城. 新兴高性能计算行业应用及发展战略[J]. 中国科学院院刊, 2019, 34(6): 640-647.
[4]	刘金. 大规模集群状态时序数据采集、存储与分析[D]. 北京: 北京邮电大学, 2018.
[5]	杨杰, 曾凌波, 彭运勇, 蒋迁谦, 杜量. 面向大规模集群的自动化监控系统[J]. 计算机工程与科学, 2020, 42(10):1801-1806.
[6]	地球大数据云服务基础平台[Z]. http://portal.casearth.cn/serviceView?id=2&modelType=1.
[7]	赵月辉. 大规模机群远程监控管理关键技术及实现[D]. 北京: 中国科学院计算技术研究所, 2006.
[8]	胡鹤, 赵毅, 牛铁, 曹荣强. 面向集群服务器系统的监控平台综述[J]. 科研信息化技术与应用, 2018, 9(1):79-88.
[9]	陈晓宇. Prometheus架构与实践分享[EB/OL].[2021-12-20]. https://www.sohu.com/a/342733264_198222.
[10]	陈森, 倪鑫易, 胡大伟. 基于Prometheus的运维管理系统[J]. 产业科技创新, 2020, 2(24): 71-72.

订阅主题	指标类型	对应表名
R.*.load.cpu	cpu相关信息	load-cpu
R.*.load.meminfo	内存相关信息	load-meminfo
R.*.load.net	网络相关信息	load-net
R.*.load.dskio	磁盘读写信息	load-dskio
R.*.load.df	磁盘使用情况	load-df
R.*.sysinfo.sysinfo	系统基本信息	sysinfo-sysinfo
R.*.proc.proclist	进程相关信息	proc-proclist
R.*.load.infiniband	IB高速网络	infiniband-info
R...info	自定义采集项	*-info
R...syslog	事件日志信息	分布式数据库

[1]	朱小杰,王华进,沈志宏,郭学兵,董文. 端到端的科学数据跨中心工作流分析框架[J]. 数据与计算发展前沿, 2023, 5(1): 15-27.
[2]	杨雪莹, 李晨, 陈逸东, 陆忠华. 基于数值方法的养老目标基金的模型与算法综述[J]. 数据与计算发展前沿, 2023, 5(1): 85-96.
[3]	金天骄,栗蔚. 基于算力网络的大数据计算资源智能调度分配方法[J]. 数据与计算发展前沿, 2022, 4(6): 29-37.
[4]	张云泉, 袁良, 袁国兴, 李希代. 2022年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2022, 4(6): 3-12.
[5]	寇大治, 韦建文, 唐小勇. 应用感知的算力优化调度方法[J]. 数据与计算发展前沿, 2022, 4(5): 3-10.
[6]	王小宁,卢莎莎,吴璨,和荣,闫晓婷,肖海力,迟学斌. 基于高性能计算环境的HPC算力编程模式[J]. 数据与计算发展前沿, 2022, 4(5): 33-41.
[7]	杨昕,沈文海. “东数西算”趋势下的气象算力网络和算力服务体系架构[J]. 数据与计算发展前沿, 2022, 4(5): 50-59.
[8]	赵佳楠,胡晓辉,杜欣欣. 基于近端策略优化算法的车载边缘计算网络频谱资源分配[J]. 数据与计算发展前沿, 2022, 4(4): 142-155.
[9]	王海涛,宋丽华. 情景感知：基本概念、关键技术与应用系统[J]. 数据与计算发展前沿, 2022, 4(3): 110-123.
[10]	朱伟浩,王坤,许丹丹,伍蕾影,成彦波. 量子计算在化学等领域的研究与应用[J]. 数据与计算发展前沿, 2022, 4(2): 131-140.
[11]	甘润东,沈舒尹,张宇哲. MXNet框架中基于OpenCL核函数的多维线性数据处理[J]. 数据与计算发展前沿, 2022, 4(2): 29-38.
[12]	李铭轩,曹畅,唐雄燕,庞冉,刘莹,刘秋妍. 基于可编程网络的UPF边缘调度机制研究[J]. 数据与计算发展前沿, 2022, 4(2): 74-86.
[13]	石京燕,黄秋兰,汪璐,李海波,杜然,姜晓巍,胡庆宝,郑伟,闫晓飞,张玄同. 国家高能物理科学数据中心分布式数据处理平台[J]. 数据与计算发展前沿, 2022, 4(1): 97-112.
[14]	何连花,赵莲,姜金荣,金钟. 高性能计算数值模拟框架软件研究进展[J]. 数据与计算发展前沿, 2021, 3(6): 108-117.
[15]	卢莎莎,肖海力,王小宁. 容器技术在高性能计算环境中的应用[J]. 数据与计算发展前沿, 2021, 3(6): 118-126.

超大规模计算集群监控系统的设计与实现

The Design and Implementation of a Monitoring System for Super-Large Computing Cluster

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 10

相关文章 15

编辑推荐

Metrics

本文评价