数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (1): 57-67.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.01.006

doi: 10.11871/jfdc.issn.2096-742X.2024.01.006

• 技术与应用 • 上一篇    下一篇

国家高性能计算环境运行状态诊断系统

赵一宁*(),肖海力   

  1. 中国科学院计算机网络信息中心,北京 100083
  • 收稿日期:2022-09-19 出版日期:2024-02-20 发布日期:2024-02-21
  • 通讯作者: * 赵一宁(E-mail: zhaoyn@sccas.cn
  • 作者简介:赵一宁,中国科学院计算机网络信息中心,高级工程师,博士,主要研究方向为分布式系统与大数据分析。
    本文负责论文撰写、分析方法研究与系统开发。ZHAO Yining, Ph.D., is a senior engineer at the Computing Network Information Center, Chinese Academy of Sciences. His research interests include distributed systems and big-data analysis, etc.
    In this paper, he is responsible for the paper composing, analysis method researching, and system development.
    E-mail: zhaoyn@sccas.cn
  • 基金资助:
    国家重点研发计划项目“国家高性能计算环境服务化机制与支撑体系研究(二期)”(2018YFB0204000)

A Monitoring and Diagnosis System for CNGrid

ZHAO Yining*(),XIAO Haili   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2022-09-19 Online:2024-02-20 Published:2024-02-21

摘要:

【目的】本文介绍了一种在大规模分布式运行环境中建立运行状态诊断系统的方法。【应用背景】为保障高性能计算环境的稳定运行,分析日志等环境数据是一种获取环境状态侧写和发现异常的重要途经。然而分析结果通常是文本和数字,对运维人员来讲缺乏直观印象,不利于快速理解。【方法】我们建设了国家高性能计算环境运行状态诊断系统,它是一种对于目标计算环境的运行状态进行量化和可视化评判的系统,通过对于目标环境的信息收集、整理,进行不同角度的分项分析。【结果】各分析结果被集成为统一的环境运行状态分值,并采用可视化方法将其立体地表现出来,以便相关运维人员能够直观地获取环境信息和快速定位问题。【结论】整个环节绝大部分处理分析工作是由程序自动完成,环境运行状态诊断系统极大减少了人工操作量,为运维工作起到有效的支撑作用。

关键词: 状态诊断, 数据处理, 量化, 可视化应用, 高性能计算环境

Abstract:

[Objective] This paper proposes a monitoring and diagnosing system in the large-scale distributed computing environment. [Context] To improve the services and support the stable operation of the high-performance computing environment, as well as to avoid malfunction resulting from errors and failures, it is necessary to collect information such as logs from the environment so that profiling of the program execution and anomalies can be found. However, the data analyzed are usually in the form of text and numbers, which are not easily understandable to humans. [Methods] This paper demonstrates the monitoring and diagnosis system of CNGrid, which can assess the operation status of the monitored environment through quantification and visualization methods. It gathers data from CNGrid and performs analyses from several angles. [Results] The analyzed results are transformed into rating numbers and visualized figures to enable the operators quickly identify the causes and locations of anomalies. [Conclusions] The major process are automatically performed by the system, thus it greatly reduces manual effort and successfully supports the operation and maintenance works.

Key words: system diagnosing, data processing, quantification, visualization, HPC environment