数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (1): 30-41.

doi: 10.11871/jfdc.issn.2096-742X.2022.01.003

• 专刊:“国家科学数据中心联合”专刊 • 上一篇    下一篇

高能物理科学数据中心智能运维系统

胡庆宝1,2,*(),郑伟1,2(),王佳荣1,2(),汪璐1,2(),颜田1,2()   

  1. 1.中国科学院高能物理研究所,北京 100049
    2.国家高能物理科学数据中心,北京 100049
  • 收稿日期:2021-09-28 出版日期:2022-02-20 发布日期:2022-03-04
  • 通讯作者: 胡庆宝
  • 作者简介:胡庆宝,中国科学院高能物理研究所计算中心,硕士,助理研究员,主要研究方向为数据流处理、容器虚拟化、系统运维监控、海量数据索引查询、数据可视化、集群认证与鉴权等。
    本文中负责智能运维系统架构设计和部署实现,及其在数据存储、计算服务方向的运维应用部分。
    HU Qingbao, master’s degree, is an assistant research fellow of Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences. His main research interests include data stream processing, container virtualization, system operation and maintenance monitoring, massive data index query, data visualization, cluster authentication, and account authentication.
    In this paper, he is responsible for the design and deployment of the intelligent operation and maintenance system architecture, and its application in the direction of data storage and computing services. E-mail: huqb@ihep.ac.cn|郑伟,中国科学院高能物理研究所计算中心,硕士,高级工程师,主要研究方向为虚拟化容器应用,高性能计算集群系统环境部署与监控、基础设施运行管理等。
    本文中负责智能运维系统在计算服务方向的运维应用部分。
    ZHENG Wei, master’s degree, is a senior engineer at the Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences. His main research interests include virtua-lized container application, high-performance computing cluster system environment deployment and monitoring, and infrastructure operation management.
    In this paper, he is responsible for the application of intelligent operation and maintenance system in the direction of computing services. E-mail: zhengw@ihep.ac.cn|王佳荣,中国科学院高能物理研究所计算中心,博士,工程师,负责国家科学数据中心网络安全运维系统规划与建设,主要研究方向为网络安全数据分析。
    本文中负责网络安全运维系统设计与实现。
    WANG Jiarong, PH.D, is an engineer of Computing Center, Institute of High Energy Physics, Chinese Academy of Scien-ces. She is in charge of the design and construction of security operation center. Her research field is security data analysis.
    In this paper, he is responsible for the development and deploy of security operation center. E-mail: wangjr@ihep.ac.cn|汪璐,中国科学院高能物理研究所计算中心,博士,副研究员,负责高能物理研究所计算中心存储系统的规划、建设和优化,研究方向为分布式文件系统、云存储和机器学习等技术在高能物理计算环境中的应用等。
    本文中负责存储系统异常访问行为检测。
    WANG Lu, PH.D is an associate professor at the Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences. She is responsible for the planning, construction and optimization of the storage system of the computing center of the Institute of High Energy Physics. Her research interests include the application of distributed file system, cloud storage and machine learning technology in high-energy physical com-puting environment.
    In this paper, she is responsible for anomaly detection of I/O behaviors. E-mail: lu.wang@ihep.ac.cn|颜田,中国科学院高能物理研究所计算中心,博士,副研究员,负责国家科学数据中心网络安全体系规划与建设,主要研究方向为网络安全技术。
    本文中负责网络安全运维系统需求分析。
    YAN Tian, PH.D, is an associate Rese-archer of Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences. He is in charge of cybersecurity framework design and construction. His research field is cyber-security technology.
    In this paper, he is responsible for the analysis of network secu-rity operation and maintenance system requirements. E-mail: yant@ihep.ac.cn
  • 基金资助:
    国家自然科学基金青年基金“基于多维数据关联分析的高能物理计算平台智能运维技术研究”(11805226);国家自然科学基金面上项目“容器虚拟化应用于高能物理计算的研究”(11775250)

Intelligent Operation and Maintenance System for High Energy Physics Science Data Center

HU Qingbao1,2,*(),ZHENG Wei1,2(),WANG Jiarong1,2(),WANG Lu1,2(),YAN Tian1,2()   

  1. 1. High Energy Physics Institute, Chinese Academy of Sciences, Beijing 100049, China
    2. National High Energy Physics Science Data Center, Beijing 100049, China
  • Received:2021-09-28 Online:2022-02-20 Published:2022-03-04
  • Contact: HU Qingbao

摘要:

【目的】高能物理科学数据中心运维环境复杂,监控工具种类繁多,功能相对重叠且监控数据无法互通,日常运维面临巨大的挑战。为高效运用监控数据,提高数据中心运维能力,本文实现了高能物理科学数据中心智能运维系统。【方法】本文结合工业大数据技术、机器学习技术和数据中心运维需求,设计了通用的数据中心运维技术架构。介绍监控数据采集、分析、存储、共享、可视化等系统核心功能及其实现方式,以及依托该系统在数据中心数据存储、计算服务、网络安全等日常运维的具体应用效果。【结果】本文设计的运维框架,在高能物理科学数据中心日常运维中得到了成熟的应用和实践,提升了数据中心运维管理能力。【结论】智能运维系统在高能物理科学数据中心的应用,加速了运维监控从数据持久化、统一化到数据业务化、生态化的价值演进,实现了基于数据驱动的数据中心智能化运维生态。

关键词: 大数据, 数据中心运维, 智能运维系统

Abstract:

[Objective] The High-energy Physical Science Data Center has a complex operation and maintenance environment. Because the monitoring tools are various, the functions are relatively overlapped, and the monitoring data cannot be interoperable, the daily operation and maintenance are facing many challenges. To make full use of the monitoring data and improve the operation and maintenance capabilities of the data center, this paper implements an intelligent operation and maintenance system for the high-energy physical science data center. [Methods] This article combines industrial big data technology, machine learning technology, and data center operation and maintenance requirements to design a general data center operation and maintenance technology architecture. It introduces the core functions of the monitoring data collection, analysis, storage, sharing, visualization, etc., and their implementation methods. The application effects of this system in the direction of data center data storage, computing services, and network security operation and maintenance are also introduced. [Results] The operation and maintenance framework designed in this paper has been maturely applied and practiced in the daily operation and maintenance of the High-energy Physical Science Data Center and has improved the data center operation and maintenance management capabilities. [Conclusions] The application of intelligent operation and maintenance systems in the High-energy Physical Science Data Center has enhanced the value of operation and maintenance data and realized the data-driven intelligent operation and maintenance ecology of data centers.

Key words: big data, data center operation and maintenance, intelligent operation and maintenance system