Frontiers of Data and Computing ›› 2020, Vol. 2 ›› Issue (1): 55-69.

doi: 10.11871/jfdc.issn.2096-742X.2020.01.005

Special Issue: “高性能与高通量计算及应用”专刊

Previous Articles     Next Articles

Practices on Monitoring, Scheduling, and Interconnection optimization of Super-Large Computing System

Qin Xiaoning1,Wang Jiayao2,Hu Menglong2,Su Yong2,Wan Wei2,Li Bin2,Dai Rong2,Wang Zhipeng3,Ji Qing2,*()   

  1. 1. Institute of Computer Science and Technology ,Nanjing University of Aeronautics and Astronautics, Nanjing ,Jiangsu 210016, China
    2. Dawning Information Industry Co., Beijing 100193, China
    3. The High School Affiliated to Renmin University,Beijing 100080, China
  • Received:2019-11-22 Online:2020-02-20 Published:2020-03-28
  • Contact: Ji Qing E-mail:jiqing@sugon.com

Abstract:

[Objective] As the super-large scale computing systems getting more and more popular, a series of challenges have been popped up, such as processing of the massive monitoring data, the stability and flexibility of job scheduling, and the complexity and efficiency of fabric interconnection etc.. This paper summarizes the experiences and solutions for recent projects in these three aspects,. [Context] The computing systems have been moving from peta-scale to exascale, and the scale of the system could easily exceed 10 000 nodes. At the beginning of computing system design, we need to determine the selection of network topology. While during the period of operation, efficient scheduling and timely monitoring are definitely non-trivial issues. [Methods] To resolve the challenges, this paper adopts a dynamic load balancing distributed monitoring architecture and a cache sensitive distributed alarm architecture. It also quantitatively simulates the performance of different nd-Torus topology. [Results] The data show that for the computing system (~10 000 nodes), the data volume of the real-time alarm database table can be controlled within one million items. The optimized SLURM scheduling system can meet the business level requirements. As for network, the 6D-Torus topology exhibits higher performance than that of the 3D-Torus topology and fat tree topology in terms of the amount of switches & cables and the efficiency, due to its smaller network diameter and shorter average communication distance. As a result, the saturated throughput of the 6D-Torus topology could reach 40%. [Conclusions] Distributed monitoring architecture and alarm architecture can effectively solve the challenging problem of processing massive monitoring data. After optimization, SLURM successfully realizes the job scheduling function on super-large computing system. Compared with the fat tree and 3D-Torus topology, the 6D-Torus is a better choice for super-large computing systems.

Key words: computing, monitoring, job scheduling, network