数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (1): 55-69.doi: 10.11871/jfdc.issn.2096-742X.2020.01.005

• 专刊:高性能与高通量计算及应用 • 上一篇    下一篇

面向超大规模计算系统的监控、调度及网络优化实践

秦晓宁1,王家尧2,胡梦龙2,苏勇2,万伟2,李斌2,戴荣2,王志鹏3,吉青2,*()   

  1. 1. 南京航空航天大学,计算机科学与技术学院, 江苏 南京 210016
    2. 曙光信息产业(北京)有限公司,北京 100193
    3. 中国人民大学附属中学, 北京100080
  • 收稿日期:2019-11-22 出版日期:2020-02-20 发布日期:2020-03-28
  • 通讯作者: 吉青 E-mail:jiqing@sugon.com
  • 作者简介:秦晓宁,南京航空航天大学计算机科学与技术学院,博士,主要研究方向为系统架构与集成、高性能计算。
    本文承担工作为:系统核心节点与器件的设计与测试。
    Qin Xiaoning, received her PhD degree from Nanjing University of Aeronautics and Astronautics,Institute of Computer Science and Technology. Her main research interests are System architecture and integration, high performance computing.
    In this paper she undertakes the system key node and component design and tests.
    E-mail:qxn@sugon.com|王家尧,曙光信息产业(北京)有限公司,高级工程师,主要研究方向为大规模集群监控运维。
    本文承担工作为:Gridview框架代码开发指导。
    Wang Jiayao, is currently a Dawning Information Industry (Beijing) Co. Ltd., senior engineer. His main research interests are large scale cluster monitoring and operation.
    In this paper he undertakes the following tasks: code design and execution director of the whole framework.
    E-mail:wangjya@sugon.com|胡梦龙,曙光信息产业(北京)有限公司,工程师,主要研究方向为大规模和高通量集群管理与作业调度。
    本文承担工作为:调度相关优化。
    Hu Menglong, is a Dawning Information Industry (Beijing) Co., Ltd., engineer. His main research interests are large-scale and high-throughput cluster management and job scheduling.
    In this paper he undertakes the following tasks: scheduling related optimization.
    E-mail: huml1@sugon.com|苏勇,曙光信息产业(北京)有限公司,博士,工程师,主要研究领域为计算机体系结构、高性能互连网络。
    本文承担工作为:网络架构设计与实现。
    Su Yong, PhD, is an Engineer. His main research interests include computer architecture and high performance interconnection networks.
    In this paper he undertakes the task of the fabric design and realization.
    E-mail:sy.pass@163.com|万伟,中科曙光高性能计算产品事业部副总经理,主任工程师,主要研究领域为计算机网络、操作系统、安全和机器学习。
    本文承担工作为:网络架构设计。
    Wan Wei, is the vice general manager of Sugon HPC product division, lead engineer. His main research interests are computer network, operation system, security and machine learning.
    In this paper he undertakes the task of the fabric design.
    E-mail: wanwei@sugon.com|李斌,曙光高性能计算产品事业部总经理。在高性能计算机体系结构、相关软硬件技术方面有全面的知识结构,对多领域和行业的高性能计算应用有深入理解,对高性能计算行业的发展现状和趋势有独到认识。
    本文承担工作为:系统规划与方案设计及实施优化。
    Li Bin, is the head of HPC products division of Dawning Information Industry Co., Ltd.. He has comprehensive knowledge structure in HPC architecture, related software and hardware technology, and a deep understanding of HPC applications.
    Role in this paper: system planning, scheme design, implementation and optimization.
    E-mail:libin@sugon.com|戴荣,曙光信息产业股份有限公司高性能计算产品事业部总工程师。主要研究方向为多领域高性能计算解决方案及大型高性能计算中心建设方案。
    本文承担工作为:系统规划与方案设计。
    Dai Rong, is the chief engineer of HPC products division of Dawning Information Industry Co., Ltd.. His main research direction is multi-domain HPC solutions and large-scale computing center construction projects.
    In this paper he undertakes system planning and scheme design
    E-mail:dair@sugon.com|王志鹏,中国人民大学附属中学,高级教师,主要研究方向是神经生物学。
    本文承担工作为:系统协同设计讨论。
    Wang Zhipeng, is vice-senior title teacher of the High School Affiliated to Renmin University of China. His main research interest is neurobiology.
    Role in this paper: Responsible for system co-design discussion.
    E-mail: wangzhipeng@rdfz.cn|吉青,中科曙光高性能计算产品事业部,首席科学家,主要研究方向为HPC应用、推广及HPC协同设计。
    本文承担工作为:项目申请与汇总,全文统筹。
    JI Qing, is the Chief scientist of HPC application in Dawning Information Industry Co., Ltd.. Her main research interests are HPC application, promotion, and system co-design.
    In this paper she undertakes the project application and report.
  • 基金资助:
    国家重点研发计划(2018YFB 0204400)

Practices on Monitoring, Scheduling, and Interconnection optimization of Super-Large Computing System

Qin Xiaoning1,Wang Jiayao2,Hu Menglong2,Su Yong2,Wan Wei2,Li Bin2,Dai Rong2,Wang Zhipeng3,Ji Qing2,*()   

  1. 1. Institute of Computer Science and Technology ,Nanjing University of Aeronautics and Astronautics, Nanjing ,Jiangsu 210016, China
    2. Dawning Information Industry Co., Beijing 100193, China
    3. The High School Affiliated to Renmin University,Beijing 100080, China
  • Received:2019-11-22 Online:2020-02-20 Published:2020-03-28
  • Contact: Ji Qing E-mail:jiqing@sugon.com

摘要:

【目的】为应对超大规模计算系统所带来的监控数据风暴、作业调度稳定性及灵活性、网络复杂度及高效性等实际挑战,本文分享了近期真实实践的经验和解决办法。【应用背景】当计算系统从P级逐渐向E级过渡,节点数量可超过10 000个。在计算系统设计之初就需要确定网络拓扑的选型,而在系统的具体使用中更是离不开高效的调度和及时的监控。【方法】本文采用了基于动态负载均衡的分布式监控架构设计,基于高速缓存的分布式告警架构设计,基于SLURM的源码和配置优化,以及nd-Torus网络拓扑仿真对比等相关技术手段,基本满足了实际业务使用需求。【结果】数据表明,对于~10 000节点的计算系统,实时告警数据库表的数据量大小基本可以控制在100万条以内。优化后的SLURM调度系统,可满足系统的业务级调度需求。网络方面,6D-Torus网络由于网络直径低、平均通信距离短,性能和网卡线缆用量较Fat-Tree网络和3D-Torus有一定提升,饱和吞吐率超过40%。【结论】分布式监控架构和告警架构可以有效解决监控数据风暴问题。SLURM在优化后可以实现对超大规模计算系统的作业调度功能。就线缆和交换机使用数量而言,6D-Torus相对于传统Fat-Tree网络更加经济,且性能优于3D-Torus,更适合超大规模计算系统。

关键词: 计算, 监控, 作业调度, 网络

Abstract:

[Objective] As the super-large scale computing systems getting more and more popular, a series of challenges have been popped up, such as processing of the massive monitoring data, the stability and flexibility of job scheduling, and the complexity and efficiency of fabric interconnection etc.. This paper summarizes the experiences and solutions for recent projects in these three aspects,. [Context] The computing systems have been moving from peta-scale to exascale, and the scale of the system could easily exceed 10 000 nodes. At the beginning of computing system design, we need to determine the selection of network topology. While during the period of operation, efficient scheduling and timely monitoring are definitely non-trivial issues. [Methods] To resolve the challenges, this paper adopts a dynamic load balancing distributed monitoring architecture and a cache sensitive distributed alarm architecture. It also quantitatively simulates the performance of different nd-Torus topology. [Results] The data show that for the computing system (~10 000 nodes), the data volume of the real-time alarm database table can be controlled within one million items. The optimized SLURM scheduling system can meet the business level requirements. As for network, the 6D-Torus topology exhibits higher performance than that of the 3D-Torus topology and fat tree topology in terms of the amount of switches & cables and the efficiency, due to its smaller network diameter and shorter average communication distance. As a result, the saturated throughput of the 6D-Torus topology could reach 40%. [Conclusions] Distributed monitoring architecture and alarm architecture can effectively solve the challenging problem of processing massive monitoring data. After optimization, SLURM successfully realizes the job scheduling function on super-large computing system. Compared with the fat tree and 3D-Torus topology, the 6D-Torus is a better choice for super-large computing systems.

Key words: computing, monitoring, job scheduling, network