数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (5): 3-10.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.05.001

doi: 10.11871/jfdc.issn.2096-742X.2022.05.001

• 专刊:东数西算:开启算力经济时代的世纪工程(上) • 上一篇    下一篇

应用感知的算力优化调度方法

寇大治1,*(),韦建文2,唐小勇3   

  1. 1.上海超级计算中心,上海 201203
    2.上海交通大学,高性能计算中心,上海 200240
    3.长沙理工大学,计算机与通信工程学院,湖南 长沙 410114
  • 收稿日期:2022-07-11 出版日期:2022-10-20 发布日期:2022-10-27
  • 通讯作者: 寇大治
  • 作者简介:寇大治,上海超级计算中心,高级工程师,主要研究领域为高性能计算集群系统、高性能计算的应用。
    本文中负责制定论文框架,撰写第1节超算历史作业信息数据库,第3节多中心间任务迁移机制的研究,第5节结论与展望。
    KOU Dazhi, is a senior engineer at the Shanghai Super-computer Center. His research interests include HPC cluster systems and HPC applications.
    In this paper, he is responsible for drawing up the paper fra-mework and writing: 1. Application running history database, 3. Container migration of HPC applications, 5. Conc-lusion and prospect.
    E-mail: dzkou@ssc.net.cn
  • 基金资助:
    国家重点研发计划“基于应用的优化调度方法与实现”(2018YFB0204004)

Application-Aware Method for Optimized Computing Power Scheduling

KOU Dazhi1,*(),WEI Jianwen2,TANG Xiaoyong3   

  1. 1. Shanghai Supercomputer Center, Shanghai 201203, China
    2. Center for High Performance Computing, Shanghai Jiao Tong University, Shanghai 200240, China
    3. School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, Hunan 410114, China
  • Received:2022-07-11 Online:2022-10-20 Published:2022-10-27
  • Contact: KOU Dazhi

摘要:

【目的】在“东数西算”工程的大背景下,为了更好地实现对分布在不同地域超级计算机资源的调度管理,针对计算资源忙闲不均等问题,提出通过研究典型应用作业的运行特征,开发多中心任务的调度系统,以解决国家高性能计算环境统一调度的关键技术问题。【方法】首先收集了若干超级计算中心的应用运行历史情况,建立了应用运行历史数据库;其次将用户应用对资源的需求和典型应用的资源使用特征分析相结合,通过机器学习的方法,建立了一种可精确描述应用特征的框架;然后实现了跨集群高性能计算应用的容器方式迁移;最后研究了基于多中心应用特征的任务调度方法,开发了基于应用感知的全局资源优化调度系统。【结果】该系统为国家高性能计算环境服务化运营和稳定运行提供了有力的技术支撑。【结论】基于应用感知的算力优化调度方法可望有效提高“东数西算”的可靠性、可用性和可维护性。

关键词: 高性能计算系统, 历史数据库, 应用特征, 算力调度方法

Abstract:

[Objective] Under the background of the project of “East-West Computing Requirement Transfer”, the super-computing resources distributed in different regions will be scheduled and managed. In order to avoid the problem of busy and unevenly distribution of computing resources, it is necessary to develop a multi-center task scheduling system by investigating the runtime characteristics of typical applications to achieve unified management of the national high-performance computing environment. [Methods] Firstly, the log data about application execution at several national supercomputing centers are collected and the database for the application log data is established. Secondly, by taking the user resources demand and the resource usage characteristics of typical applications into consideration, a machine learning framework is established to accurately depict the application execution features. Then migration of HPC applications across clusters using containers is implemented. Finally, a task scheduling system based on application-aware resource scheduling optimization is developed. [Results] This system provides powerful technical support for services and efficient operation of the national high-performance computing environment. [Conclusions] The application-aware method for computing power scheduling optimization is expected to effectively improve the reliability, availability, and maintainability of the “East-West Computing Requirement Transfer” project.

Key words: High Performance Computing system, historical database, application feature, computing power scheduling method