面向超大规模计算系统的监控、调度及网络优化实践

doi:10.11871/jfdc.issn.2096-742X.2020.01.005

数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (1): 55-69.

doi: 10.11871/jfdc.issn.2096-742X.2020.01.005

所属专题： “高性能与高通量计算及应用”专刊

• 专刊:高性能与高通量计算及应用 • 上一篇下一篇

面向超大规模计算系统的监控、调度及网络优化实践

秦晓宁¹,王家尧²,胡梦龙²,苏勇²,万伟²,李斌²,戴荣²,王志鹏³,吉青^2,^*()

1. 南京航空航天大学,计算机科学与技术学院, 江苏南京 210016
2. 曙光信息产业(北京)有限公司,北京 100193
3. 中国人民大学附属中学, 北京100080

收稿日期:2019-11-22 出版日期:2020-02-20 发布日期:2020-03-28
通讯作者: 吉青
作者简介:秦晓宁,南京航空航天大学计算机科学与技术学院,博士,主要研究方向为系统架构与集成、高性能计算。
本文承担工作为：系统核心节点与器件的设计与测试。
Qin Xiaoning, received her PhD degree from Nanjing University of Aeronautics and Astronautics,Institute of Computer Science and Technology. Her main research interests are System architecture and integration, high performance computing.
In this paper she undertakes the system key node and component design and tests.
E-mail：qxn@sugon.com|王家尧,曙光信息产业（北京）有限公司,高级工程师,主要研究方向为大规模集群监控运维。
本文承担工作为：Gridview框架代码开发指导。
Wang Jiayao, is currently a Dawning Information Industry (Beijing) Co. Ltd., senior engineer. His main research interests are large scale cluster monitoring and operation.
In this paper he undertakes the following tasks: code design and execution director of the whole framework.
E-mail：wangjya@sugon.com|胡梦龙,曙光信息产业（北京）有限公司,工程师,主要研究方向为大规模和高通量集群管理与作业调度。
本文承担工作为：调度相关优化。
Hu Menglong, is a Dawning Information Industry (Beijing) Co., Ltd., engineer. His main research interests are large-scale and high-throughput cluster management and job scheduling.
In this paper he undertakes the following tasks: scheduling related optimization.
E-mail: huml1@sugon.com|苏勇,曙光信息产业（北京）有限公司,博士,工程师,主要研究领域为计算机体系结构、高性能互连网络。
本文承担工作为：网络架构设计与实现。
Su Yong, PhD, is an Engineer. His main research interests include computer architecture and high performance interconnection networks.
In this paper he undertakes the task of the fabric design and realization.
E-mail:sy.pass@163.com|万伟,中科曙光高性能计算产品事业部副总经理,主任工程师,主要研究领域为计算机网络、操作系统、安全和机器学习。
本文承担工作为：网络架构设计。
Wan Wei, is the vice general manager of Sugon HPC product division, lead engineer. His main research interests are computer network, operation system, security and machine learning.
In this paper he undertakes the task of the fabric design.
E-mail: wanwei@sugon.com|李斌,曙光高性能计算产品事业部总经理。在高性能计算机体系结构、相关软硬件技术方面有全面的知识结构,对多领域和行业的高性能计算应用有深入理解,对高性能计算行业的发展现状和趋势有独到认识。
本文承担工作为：系统规划与方案设计及实施优化。
Li Bin, is the head of HPC products division of Dawning Information Industry Co., Ltd.. He has comprehensive knowledge structure in HPC architecture, related software and hardware technology, and a deep understanding of HPC applications.
Role in this paper: system planning, scheme design, implementation and optimization.
E-mail:libin@sugon.com|戴荣,曙光信息产业股份有限公司高性能计算产品事业部总工程师。主要研究方向为多领域高性能计算解决方案及大型高性能计算中心建设方案。
本文承担工作为：系统规划与方案设计。
Dai Rong, is the chief engineer of HPC products division of Dawning Information Industry Co., Ltd.. His main research direction is multi-domain HPC solutions and large-scale computing center construction projects.
In this paper he undertakes system planning and scheme design
E-mail：dair@sugon.com|王志鹏,中国人民大学附属中学,高级教师,主要研究方向是神经生物学。
本文承担工作为：系统协同设计讨论。
Wang Zhipeng, is vice-senior title teacher of the High School Affiliated to Renmin University of China. His main research interest is neurobiology.
Role in this paper: Responsible for system co-design discussion.
E-mail: wangzhipeng@rdfz.cn|吉青,中科曙光高性能计算产品事业部,首席科学家,主要研究方向为HPC应用、推广及HPC协同设计。
本文承担工作为：项目申请与汇总,全文统筹。
JI Qing, is the Chief scientist of HPC application in Dawning Information Industry Co., Ltd.. Her main research interests are HPC application, promotion, and system co-design.
In this paper she undertakes the project application and report.
基金资助:
国家重点研发计划(2018YFB 0204400)

Practices on Monitoring, Scheduling, and Interconnection optimization of Super-Large Computing System

Qin Xiaoning¹,Wang Jiayao²,Hu Menglong²,Su Yong²,Wan Wei²,Li Bin²,Dai Rong²,Wang Zhipeng³,Ji Qing^2,^*()

1. Institute of Computer Science and Technology ,Nanjing University of Aeronautics and Astronautics, Nanjing ,Jiangsu 210016, China
2. Dawning Information Industry Co., Beijing 100193, China
3. The High School Affiliated to Renmin University,Beijing 100080, China

Received:2019-11-22 Online:2020-02-20 Published:2020-03-28
Contact: Ji Qing

摘要/Abstract

摘要：

【目的】为应对超大规模计算系统所带来的监控数据风暴、作业调度稳定性及灵活性、网络复杂度及高效性等实际挑战,本文分享了近期真实实践的经验和解决办法。【应用背景】当计算系统从P级逐渐向E级过渡,节点数量可超过10 000个。在计算系统设计之初就需要确定网络拓扑的选型,而在系统的具体使用中更是离不开高效的调度和及时的监控。【方法】本文采用了基于动态负载均衡的分布式监控架构设计,基于高速缓存的分布式告警架构设计,基于SLURM的源码和配置优化,以及nd-Torus网络拓扑仿真对比等相关技术手段,基本满足了实际业务使用需求。【结果】数据表明,对于~10 000节点的计算系统,实时告警数据库表的数据量大小基本可以控制在100万条以内。优化后的SLURM调度系统,可满足系统的业务级调度需求。网络方面,6D-Torus网络由于网络直径低、平均通信距离短,性能和网卡线缆用量较Fat-Tree网络和3D-Torus有一定提升,饱和吞吐率超过40%。【结论】分布式监控架构和告警架构可以有效解决监控数据风暴问题。SLURM在优化后可以实现对超大规模计算系统的作业调度功能。就线缆和交换机使用数量而言,6D-Torus相对于传统Fat-Tree网络更加经济,且性能优于3D-Torus,更适合超大规模计算系统。

关键词: 计算, 监控, 作业调度, 网络

Abstract:

[Objective] As the super-large scale computing systems getting more and more popular, a series of challenges have been popped up, such as processing of the massive monitoring data, the stability and flexibility of job scheduling, and the complexity and efficiency of fabric interconnection etc.. This paper summarizes the experiences and solutions for recent projects in these three aspects,. [Context] The computing systems have been moving from peta-scale to exascale, and the scale of the system could easily exceed 10 000 nodes. At the beginning of computing system design, we need to determine the selection of network topology. While during the period of operation, efficient scheduling and timely monitoring are definitely non-trivial issues. [Methods] To resolve the challenges, this paper adopts a dynamic load balancing distributed monitoring architecture and a cache sensitive distributed alarm architecture. It also quantitatively simulates the performance of different nd-Torus topology. [Results] The data show that for the computing system (~10 000 nodes), the data volume of the real-time alarm database table can be controlled within one million items. The optimized SLURM scheduling system can meet the business level requirements. As for network, the 6D-Torus topology exhibits higher performance than that of the 3D-Torus topology and fat tree topology in terms of the amount of switches & cables and the efficiency, due to its smaller network diameter and shorter average communication distance. As a result, the saturated throughput of the 6D-Torus topology could reach 40%. [Conclusions] Distributed monitoring architecture and alarm architecture can effectively solve the challenging problem of processing massive monitoring data. After optimization, SLURM successfully realizes the job scheduling function on super-large computing system. Compared with the fat tree and 3D-Torus topology, the 6D-Torus is a better choice for super-large computing systems.

Key words: computing, monitoring, job scheduling, network

秦晓宁,王家尧,胡梦龙,苏勇,万伟,李斌,戴荣,王志鹏,吉青. 面向超大规模计算系统的监控、调度及网络优化实践[J]. 数据与计算发展前沿, 2020, 2(1): 55-69.

Qin Xiaoning,Wang Jiayao,Hu Menglong,Su Yong,Wan Wei,Li Bin,Dai Rong,Wang Zhipeng,Ji Qing. Practices on Monitoring, Scheduling, and Interconnection optimization of Super-Large Computing System[J]. Frontiers of Data and Computing, 2020, 2(1): 55-69.

图/表 13

图1

图2

表1

图3

图4

图5

图6

图7

图8

表2

网络平均距离和拓扑类型的关系"

拓扑类型	网络平均距离 H ( 处理器数量P )
Hypercube	$(1 / 2) × log ? (P)$
2D-Torus	$(1 / 2) × P 1 / 2$
3D-Torus	$(3 / 4) × P 1 / 3$
6D-Torus	$3 × (P 1 / 6 - P - 1 / 6) /$ 2

表2

图9

图10

表3

3D-Torus与Fat-tree网络拓扑的理论对比,其中N为节点数,f为优化因子"

三线表绘制拓扑	3D Torus	6D Torus	Fat Tree
类型	直接	直接	间接
交换机数量	N	N	Nlog_fN
线缆数量	3N	6N	fNlog_fN
平均延迟	$34 N 1 / 3$	$3$ $($ $N 1 / 6 - N - 1 / 6) /$ 2	2(log_fN-1）< L < 2log_fN
双向带宽	$2 N 2 / 3$	$2 N 1 / 3$	fN
近邻优化	Yes	Yes	Yes

表3

参考文献 24

[1]	STROHMAIER E . TOP500 supercomputer[M]. 2006.
[2]	张云泉 . 2018年中国高性能计算机发展现状分析与展望[J]. 计算机科学, 2019,46(1):1-5.
[3]	JIAO L, WEIPING X, WUSHENG Z , et al. Equipment management and system maintenance on HPC platform[J]. Experimental Technology and Management, 2013,30(5):87-90.
[4]	KARJOTH G . Access control with IBM Tivoli access manager[J]. ACM Transactions on Information and System Security (TISSEC), 2003,6(2):232-57.
[5]	HERNANDEZ H M, WINTER R L . Sender device based pause system[M]. Google Patents. 2015.
[6]	GUANGBAO N, JIE M, BO L . GridView: A dynamic and visual grid monitoring system; proceedings of the High Performance Computing and Grid in Asia Pacific Region, 2004 Proceedings Seventh International Conference on, F, 2004 [C]. IEEE.
[7]	童端, 董小社, 李纪云 , et al. 基于OpenPBS的机群作业管理系统的设计与实现[J]. 计算机工程与应用, 2004,40(13):123-5.
[8]	赵宗弟, 胡凯, 胡建平 , et al. 基于PBS的集群作业调度策略的设计与实现[J]. 计算机与数字工程, 2006,34(11):123-7.
[9]	王翠萍 . LSF系统中作业调度的研究与优化[D]; 西安电子科技大学, 2009.
[10]	YOO A B, JETTE M A, GRONDONA M . SLURM: Simple Linux Utility for Resource Management[J]. 2003.
[11]	ABTS D, MARTY M R, WELLS P M , et al. Energy proportional datacenter networks[J]. Acm Sigarch Computer Architecture News, 2010,38(3):338-47.
[12]	面向E—HPC的新型高性能互连网络技术研究项目简介[J]. 计算机工程与科学, 2016, ( 11):2288.
[13]	FAANES G, BATAINEH A, ROWETH D , et al. Cray Cascade: A scalable HPC system based on a Dragonfly network[J]. 2012
[14]	AGARWAL A . Limits on interconnection network performance[J]. IEEE Transactions on Parallel and Distributed Systems, 1991,( 4):398-412.
[15]	BARKER K J, BENNER A F, HOARE R R, et al. On the Feasibility of Optical Circuit Switching for High Performance Computing Systems [C]. 2005.
[16]	VETTER J S, MUELLER F. Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures; proceedings of the Vehicle Navigation & Information Systems Conference, F, 2002 [C].
[17]	BARROW-WILLIAMS N, FENSCH C, MOORE S W. A communication characterisation of Splash-2 and Parsec; proceedings of the IEEE International Symposium on Workload Characterization, F, 2009 [C].
[18]	SERVICE R F . China's planned exascale computer threatens Summit's position at the top[J]. 2018,359(6376):618.
[19]	赵玉国 . 基于OSGi技术的研究与应用[D]. 北京邮电大学, 2013.
[20]	李俊娥, 周洞汝 . “平台/插件”软件体系结构风格[J]. 小型微型计算机系统, 2007,( 5):110-5.
[21]	曾超宇, 李金香 . Redis在高速缓存系统中的应用[J]. 微型机与应用, 2013,32(12):11-3.
[22]	ZHOU Z, XU Y, LAN Z , et al. Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints; proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), F, 2015 [C].
[23]	ADACHI T, SHIDA N, MIURA K . The design of ultra scalable MPI collective communication on the K computer[J]. Computer Science - Research and Development, 28(2-3):147-55.
[24]	FAN Z, ZHENG C, YONG S , et al. HiNetSim: A Parallel Simulator for Large-Scale Hierarchical Direct Networks[J]. 2014.

	理论可支持节点数量	已测支持节点数量
单层监控	1000	1098
双层监控	10000	12000

面向超大规模计算系统的监控、调度及网络优化实践

Practices on Monitoring, Scheduling, and Interconnection optimization of Super-Large Computing System

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 24

相关文章 15

编辑推荐

Metrics

本文评价

[1]	杨超波, 谢卫红, 王力纲. 网络舆情SIR模型优化与干预研究[J]. 数据与计算发展前沿, 2023, 5(1): 115-127.
[2]	朱小杰,王华进,沈志宏,郭学兵,董文. 端到端的科学数据跨中心工作流分析框架[J]. 数据与计算发展前沿, 2023, 5(1): 15-27.
[3]	许淞源,刘峰. ESDRec：一种面向地球大数据平台的数据推荐模型[J]. 数据与计算发展前沿, 2023, 5(1): 55-64.
[4]	郭学兵,张黎,何洪林. 森林每木生物量数据质控方法和技术研究[J]. 数据与计算发展前沿, 2023, 5(1): 65-73.
[5]	杨雪莹, 李晨, 陈逸东, 陆忠华. 基于数值方法的养老目标基金的模型与算法综述[J]. 数据与计算发展前沿, 2023, 5(1): 85-96.
[6]	彭亮, 牛铁, 魏宝亮, 赵毅. 超大规模计算集群监控系统的设计与实现[J]. 数据与计算发展前沿, 2023, 5(1): 97-103.
[7]	栗蔚,王雨萌,立言,赵伟博,苏越. “东数西算”背景下算力服务对算力经济发展影响分析[J]. 数据与计算发展前沿, 2022, 4(6): 13-19.
[8]	周舸帆,雷波. 算力网络中基于算力标识的算力服务需求匹配[J]. 数据与计算发展前沿, 2022, 4(6): 20-28.
[9]	金天骄,栗蔚. 基于算力网络的大数据计算资源智能调度分配方法[J]. 数据与计算发展前沿, 2022, 4(6): 29-37.
[10]	张云泉, 袁良, 袁国兴, 李希代. 2022年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2022, 4(6): 3-12.
[11]	叶沁丹,范贵生,黄衡阳. 算力网络一体化支撑方案及应用场景探索[J]. 数据与计算发展前沿, 2022, 4(6): 55-66.
[12]	童昭,王露笛,朱小杰,杜一. 基于预训练模型的军事领域命名实体识别研究[J]. 数据与计算发展前沿, 2022, 4(5): 120-128.
[13]	易昕昕,马贺荣,曹畅,唐雄燕. 算力网络可编程服务路由策略的分析与探讨[J]. 数据与计算发展前沿, 2022, 4(5): 23-32.
[14]	寇大治, 韦建文, 唐小勇. 应用感知的算力优化调度方法[J]. 数据与计算发展前沿, 2022, 4(5): 3-10.
[15]	王小宁,卢莎莎,吴璨,和荣,闫晓婷,肖海力,迟学斌. 基于高性能计算环境的HPC算力编程模式[J]. 数据与计算发展前沿, 2022, 4(5): 33-41.