This special issue was supported by the Strategic Priority Research Program
of Chinese Academy of Sciences,Grant No. XDC01000000
[Context] Theoretical analysis, experimental observation and computational simulation are the three major methods of scientific research. High performance computing, as one of the main symbols of a country's comprehensive national strength, has important strategic significance.[Methods] This article focuses on the development of high-performance computing, the construction of high performance computer environments, and the development of computing applications in China, which reviews the development history and main achievements of high-performance computing in the past 20 years . Furthermore, the deployment and progress of China's key R&D projects in high-performance computing during the current "13th Five-Year Plan" period are introduced.[Conclusions] Based on the analysis of the current high-performance computing development situation, the author proposes and discusses that strengthening the construction of new-generation computing infrastructure and applications.
[Objective] Based on the data of China's high performance computer TOP100 rankings published in November 2019, this paper makes an in-depth analysis of the current development status of high performance computers in China from the overall performance, manufacturers, industry status and many other aspects. [Results] The average Linpack performance of TOP100 in China continues to be higher than those of the international TOP500, and the threshold for entry performance of TOP100 still exceeds that of TOP500. China's supercomputing systems on TOP100 are mostly domestic supercomputer systems designed by Sugon and Lenovo, the tied champion on the number of systems on Top100. The situation of the three strong hegemonies of Sugon, Lenovo and Inspur continues will be maintained and strengthened in the near future. On the basis of this, according to the performance data of the eighteenth ranking list, this paper analyzes and predicts the development trend of high-performance computers in China. [Conclusions] According to the new data, we believe that machines with peak Exa-flops will appear between 2020 and 2021; machines with peaks of 10 Exa-flops will appear between 2022 and 2023; machines with peaks of 100 Exa-flops will appear between 2024 and 2025.
[Objective] This paper mainly analyses the main challenges brought to computer system by the rapid increase of data scale of AI and big data application. In view of the development trend of computer system, some research directions of high-efficiency computing towards AI and big data are given. [Coverage] In this paper, the latest research results and challenges of big data and artificial intelligence computing on supercomputing and high performance computing platforms at home and abroad are extensively surveyed. [Methods] Big data not only provides an increasingly rich training data set for artificial intelligence, but also puts forward higher requirements for the computing power of computer systems. In recent years, China's supercomputer techniques are at the forefront of the world, which provides a powerful computing platform for large-scale applications of big data and artificial intelligence. [Results] At present, high-performance computing platforms represented by supercomputers mostly use heterogeneous parallel computing systems composed of CPUs and accelerators, where a large number of computing cores can provide powerful computing power for AI and big data applications. [Limitations]However, due to the complex architecture, there are major challenges in making full use of computing power and improving computing efficiency. The parallel computing efficiency is more difficult to improve, especially in the artificial intelligence and big data domains which are different from scientific computing. [Conclusions] Therefore, it is required to conduct research on high-performance computing from underlying resource management, task scheduling, basic algorithm design, and communication optimization to the upper level of model parallelization, so that the computational efficiency of artificial intelligence and big data applications on high-performance computing platforms can be improved.
[Objective] The development and performance optimizations, especially on parallel computations, of the Earth System Model of Chinese Academy of Sciences (CAS-ESM) are introduced in this paper. [Methods] Based on CAS-ESM1.0, a series of computing optimizations are conducted such as three-dimensional parallel decomposition, leap finite-difference algorithm and communication-avoiding method in atmospheric and oceanic component models, showing obvious improvements in parallel computational speed and efficiency of the models. Also, a platform for coupling and application of Earth System Model is developed. In addition, CAS-ESM2.0 is set up after applying parameterization improvements to component models in many aspects. [Results] Compared with CAS-ESM1.0, CAS-ESM2.0 presents a distinct advance not only in computation efficiency but also in climate simulation results. Besides, it is able to reproduce the global surface CO2 distribution and its seasonal variation by this fully coupled model. CAS-ESM2.0 will take CMIP6 experiments and its simulation data will be opened and shared.
[Objective] As the super-large scale computing systems getting more and more popular, a series of challenges have been popped up, such as processing of the massive monitoring data, the stability and flexibility of job scheduling, and the complexity and efficiency of fabric interconnection etc.. This paper summarizes the experiences and solutions for recent projects in these three aspects,. [Context] The computing systems have been moving from peta-scale to exascale, and the scale of the system could easily exceed 10 000 nodes. At the beginning of computing system design, we need to determine the selection of network topology. While during the period of operation, efficient scheduling and timely monitoring are definitely non-trivial issues. [Methods] To resolve the challenges, this paper adopts a dynamic load balancing distributed monitoring architecture and a cache sensitive distributed alarm architecture. It also quantitatively simulates the performance of different nd-Torus topology. [Results] The data show that for the computing system (~10 000 nodes), the data volume of the real-time alarm database table can be controlled within one million items. The optimized SLURM scheduling system can meet the business level requirements. As for network, the 6D-Torus topology exhibits higher performance than that of the 3D-Torus topology and fat tree topology in terms of the amount of switches & cables and the efficiency, due to its smaller network diameter and shorter average communication distance. As a result, the saturated throughput of the 6D-Torus topology could reach 40%. [Conclusions] Distributed monitoring architecture and alarm architecture can effectively solve the challenging problem of processing massive monitoring data. After optimization, SLURM successfully realizes the job scheduling function on super-large computing system. Compared with the fat tree and 3D-Torus topology, the 6D-Torus is a better choice for super-large computing systems.
[Objective] With the rapid growth of new high-throughput applications such as cloud computing, the Internet of Things, and artificial intelligence, the main applications of high-performance computing have gradually evolved from traditional scientific and engineering computing to emerging data processing, which brought huge challenges to traditional processors. High-throughput many-core processors are becoming a new type of processor architecture dealing with such applications and therefore an important research direction. [Method] In view of the above problems, this paper analyzes the typical characteristics of high-throughput applications, and discusses the key design of high-throughput many-core processors from the three core aspects of data processing, transmission, and storage. The design includes real-time task dynamic scheduling, high-density on-chip network design and on-chip storage hierarchy optimization, etc. [Results] The experimental results show that the above mechanism can effectively ensure the service quality of tasks, improve the data throughput rates, and simplify the on-chip memory hierarchy. [Conclusion] With the urgent demand for high concurrency and strong real-time processing in the era of Internet of Everything, high-throughput many-core processors are expected to become the main processing engine in future data centers.
[Objective] In order to quickly analyze the performance of the supercomputing system and accelerate the optimization of HPL benchmark tests, this paper analyzes the main influencing factors of HPL and establishes a related parallel computing model. [Methods] Based on the parallel optimization test results of the Sugon advanced computing system HPL benchmark, the method of combining theoretical analysis and experimental verification is used to analyze the HPL efficiency upper limit, fast prediction, and influence of different parameters, on which the corresponding parallel calculations model is established. [Results] Compared with the test results of the Sugon advanced computing system, the prediction results are in good agreement with the actual measurement results, indicating the balance between factors such as computing performance and tasks, the ratio of matrix operations to HPL calculation, the efficiency of matrix operations, the utilization of matrix operation library functions, network transmission and so on can largely reflect the calculation efficiency of the HPL of the supercomputing system. Besides, the matrix operation efficiency of the acceleration card is directly proportional to the efficiency of the HPL. [Limitations] At present, the design of parallel computing models are not comprehensively considered, and how the stability requirements of a large-scale computing system affects its performance needs further studies. [Conclusions] Parallel computing models based on different forecasting requirements have important guiding significance for HPL benchmark performance prediction and parallel optimization.
[Objective] Distributed computing systems are used widely in the field of big data processing. They are designed and implemented with a focus on scalability. With good scalability, a system can hold and process a growing amount of data by adding resources without modifying the system itself while sacrificing the absolute performance of a single machine at huge expenses. We want to offer a reasonable and modern metric to evaluate the performance of distributed systems. [Methods] In this article, we discuss the performance of distributed systems by comparing them with the same task on a single machine with the proposed metric, COS, or the Configuration that Outperforms a Single machine. The COS of a system on a given problem is the number of machines required when the system outperforms a competent single-machine implementation. Given a limited hardware resources, COS of a distributed system is usually too large to measure. So, we offer another metric by giving a parameter n to COS. COS(n) equals to n multiplied by the time used on n machines over that on a single machine. COS(n) indicates the performance and expense loss in a cluster system. We implemented two classic machine learning algorithms, k-means clustering and logistic regression, on a single machine with multi-threading, SIMD support and NUMA-aware memory control. [Results] Our experiments show that by using Apache Spark, with no matter its native API or optimized machine learning library like MLlib, it needs tens to hundreds of machines to achieve the same performance as we did on a single machine. [Limitations] The comparison between a single machine and a cluster is not entirely fair, for overheads in a cluster is unavoidable. [Conclusions] This COS metric can still reflect the problems of poor absolute performance and insufficient utilization of hardware advantages in distributed systems.
[Objective] The direct numerical simulation (DNS) of hypersonic turbulence requires great many grids points and time steps. Therefore, the amount of calculation is very large. Excessively long time of calculation is an important reason that DNS cannot be applied in real applications. In order to accelerate the calculation, design of a high-performance computational fluid mechanics program OpenCFD-SCU under the CPU/GPU Heterogeneous System Architecture (HSA) is introduced in this paper. [Method] This program is based on the CPU [fortran] code OpenCFD-SC which is a high-precision finite difference solver developed by the authors. OpenCFD-SCU has the same program framework as OpenCFD-SC, and the computing part of the GPU program is programmed by CUDA to ensure that all arithmetic operations are completed on the GPU. [Results] In a same DNS task, the GPU version of OpenCFD-SCU is 60 times faster than the CPU version of OpenCFD-SC. The computing power of GPU is much higher than that of CPU. Using GPU can effectively accelerate the calculation, which is the future trend of DNS programs for hypersonic turbulence. [Conclusion] In the future, we believe that more and more hypersonic turbulence simulation can be moved to the GPU.
[Objective] In order to promote the precision medicine research, large-scale population genomic studies have been carried out globally, and population-specific genome variation maps have been built by whole genome sequencing of thousands of individuals. These projects output massive genomic data, which needs high-throughput computing (HTC) to process. However, due to the characteristics of genomic data and the diversity and complexity of process workflows, HTC computing resources are not fully utilized in genomic data analysis tasks, so that the computing speed is slow and the data exchange over servers is inconvenient. Therefore, it is necessary to optimize HTC platforms for genomic data analysis from software and hardware aspects. This paper analyzes and summarizes these optimization methods. [Methods] In an HTC system, the bottleneck of system IO is the main cause for the low parallel efficiency in genomic data processing. Generally, distributed unstructured storage database and object storage system are used to improve the scalability of large-scale IO and solve the IO problems in data processing. Meanwhile, the IO load can be reduced by using the efficient compression algorithms of genomic data. In order to accelerate genomic data processing, algorithms such as neural networks can be used to optimize genome analysis methods, and FPGA or GPU heterogeneous computing can be used to improve the speed of data analysis. [Results] In brief, the above optimization can greatly improve HTC performance by solving the IO wall problem in genomic data analysis and improving the efficiency of HTC resources, which greatly reduces the computing time of genome-wide variation analysis. [Conclusions] The software and hardware improvements can significantly increase the HTC efficiency and speed in genomic data analysis, and can promote the application of high-throughput computing on large-scale cohort studies in the future.
[Objective] In this paper, we introduce the application of materials genome approach on materials design, including explorations of catalytic materials, thermoelectric materials, metal organic framework (MOF) materials, lithium battery materials and perovskite photovoltaic materials. [Methods] High-throughput computing is combined with data mining techniques, machine learning for instance. Database is generated from the high-throughput computing, and then data mining and deep analysis are performed. [Results] Potential novel materials are screened and discovered based on the data analysis. [Limitations] Currently, some hypothetical materials are hardly realized in experiments. Thus, the theoretical predictions and experiments need to be integrated more deeply. [Conclusion] With the further development of computational and experimental technology, materials genetic approach will perform a more significant role in materials development.