数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (1): 97-103.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.01.009

doi: 10.11871/jfdc.issn.2096-742X.2023.01.009

• 技术与应用 • 上一篇    下一篇

超大规模计算集群监控系统的设计与实现

彭亮*(),牛铁,魏宝亮,赵毅   

  1. 中国科学院计算机网络信息中心,北京 100083
  • 收稿日期:2022-01-27 出版日期:2023-02-20 发布日期:2023-02-20
  • 通讯作者: * 彭亮(E-mail: pengliang@cnic.cn
  • 作者简介:彭亮,中国科学院计算机网络信息中心,科技云运行与技术发展部,工程师,主要研究方向为高性能计算系统、集群监控与分析。
    文中主要承担工作:监控系统开发与设计。
    PENG Liang is an engineer at the Department of Science and Technology Cloud Department, Computer Network Information Center, Chinese Academy of Sciences. Her research interests include high-performance computing, cluster monitoring, and analysis technology.
    In this paper, she is responsible for the development and design of the monitoring system.
    E-mail: pengliang@cnic.cn
  • 基金资助:
    中国科学院战略性先导科技专项项目(A类)(XDA19020101)

The Design and Implementation of a Monitoring System for Super-Large Computing Cluster

PENG Liang*(),NIU Tie,WEI Baoliang,ZHAO Yi   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2022-01-27 Online:2023-02-20 Published:2023-02-20

摘要:

【背景】传统集群监控软件在性能、灵活性、可扩展性上无法满足超过10000节点的超大规模计算集群以及多集群系统的监控管理需求。【目的】亟需设计研发新型集群监控系统,提升超大规模计算集群和多集群的运行管理能力与效率。【方法】本文采用总分架构设计,利用消息中间件、分布式存储、REST技术实现了一种超大规模计算集群监控系统。【结果】该系统支持监控指标自定义、数据主动上发、自动告警等功能,具有良好的横向扩展能力。已部署于多套计算集群中,满足上万节点和设备的监控需求,日均采集数据逾200GB。【局限】由于监控指标繁多、监控数据量庞大,针对业务场景的数据关联分析能力有待提升。【结论】本文工作满足了超大规模计算集群及异地多集群系统的自动运管需求,采用的方法对更大规模集群甚至E级计算系统的运管工具的研发具有积极借鉴意义。

关键词: 超大规模, 计算, 集群, HPC, 监控

Abstract:

[Background] The traditional cluster monitoring systems cannot meet the requirements of multi-clusters and super-large-scale clusters with more than 10000 nodes in performance, flexibility, and scalability. [Objective] It is urgent to develop a new monitoring system to improve the management capability and efficiency for these kinds of clusters. [Methods] This paper adopts message-oriented middleware, distributed monitoring architecture, and REST API to realize a monitoring system for above-mentioned clusters. [Results] The system supports the functions of self-definable metrics, real-time active data sending, and automatic alarm, and is of good extensibility. The system has been deployed in several computing clusters and fits the monitoring needs of the cluster with more than 10000 nodes and devices. The amount of daily data collection is more than 200 GB. [Limitations] Due to numerous kinds of monitoring metrics and mass monitoring data, the data correlation analysis ability for specific business scenarios needs to be improved. [Conclusions] The work presented in this paper meets the need for automatic management of the super-large computing cluster and the multi-cluster systems. It can be a reference in developing the management tools for even larger computing clusters and for the exascale computing systems.

Key words: super-large scale, computing, cluster, HPC, monitoring