数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (2): 164-174.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.02.013

doi: 10.11871/jfdc.issn.2096-742X.2023.02.013

• 技术与应用 • 上一篇    

基于集成学习的计算集群作业时长预测与调度方法

李贺1,2(),修涵文1,2,刘彦君3,曹荣强1,2,*(),周纯葆1,2,王彦棡1,2   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,计算机科学与技术学院,北京 100049
    3.北京航空航天大学,软件学院,北京 100191
  • 收稿日期:2022-08-23 出版日期:2023-04-20 发布日期:2023-04-24
  • 通讯作者: 曹荣强
  • 作者简介:李贺,中国科学院计算机网络信息中心,硕士研究生,主要研究方向为联邦学习。
    在本文承担主要工作为模型架构设计、整理数据和论文撰写。
    LI He is a master’s student at the Com-puter Network Information Center, Chinese Academy of Sciences. His main research interests include federal machine learning.
    In this paper, he undertook the main work of model architecture design, data sorting, and paper writing.
    E-mail: lihe@cnic.cn|曹荣强,中国科学院计算机网络信息中心,副研究员,主要研究方向为人工智能平台。
    在本文中,负责整体规划、模型修改、实验设计和论文指导。
    CAO Rongqiang, is an associate resea-rcher at the Computer Network Information Center, Chinese Academy of Sciences. His main research direction is artificial intelligence platforms.
    In this paper, he is responsible for the overall planning, model modification, experimental design, and paper guidance.
    E-mail: caorq@cnic.cn
  • 基金资助:
    中国国家电网有限公司总部管理科技项目“自主可控电力人工智能开放平台关键技术研究”(5700-202158-261A-0-0-00)

A Job Duration Prediction Method for Computing Clusters Based on Ensemble Machine Learning

LI He1,2(),XIU Hanwen1,2,LIU Yanjun3,CAO Rongqiang1,2,*(),ZHOU Chunbao1,2,WANG Yangang1,2   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
    3. School of Software, Beihang University, Beijing 100191, China
  • Received:2022-08-23 Online:2023-04-20 Published:2023-04-24
  • Contact: CAO Rongqiang

摘要:

【目的】本文的研究是为了提升作业时长预测的准确度,改善作业回填调度的性能,进而提高计算集群的计算资源利用率。【应用背景】作业调度在提升计算集群计算资源利用率方面发挥着重要作用,而作业时长预测是作业回填调度策略的关键判断依据。【方法】本论文基于集成学习方法对计算集群作业时长进行预测,集成了支持向量回归、随机森林、梯度提升回归树和自动机器学习等算法,并且采用本论文预测的作业时长进行了作业回填调度实验。【结果】采用本方法分别在HPC2N、CEA Curie和KIT FH2三个典型数据集上进行了测试,作业时长预测值的均方根误差比用户预估方法分别降低60.30%、51.91%、63.51%,比线性回归方法分别降低44.37%、31.98%、52.69%。【结论】作业回填调度模拟实验结果表明,本方法能够大幅提升作业时长预测的准确度,作业平均等待时间比用户预估方法分别降低9.07%、8.80%、1.83%,并且能够改善作业回填调度的性能,平均有界减速值比用户预估方法分别降低7.72%、0.96%、9.05%,提高了计算集群的计算资源利用率。

关键词: 集成学习, 计算集群, 作业时长预测, 回填调度策略

Abstract:

[Objective] This paper presents a new method to improve the accuracy of job duration prediction, increase the performance of job backfill scheduling, and improve the utilization of computing resources of computing clusters. [Context] Job scheduling plays an essential role in improving the utilization of computing resources in supercomputing clusters, and job duration prediction is the key judgment basis for job backfill scheduling strategy. [Methods] Based on the ensemble learning method, this paper predicts the job duration of the computing cluster, and integrates algorithms such as support vector regression, random forest, gradient lifting regression tree, and automatic machine learning. The job duration predicted by this method is used to conduct the job backfill scheduling experiment. [Results] The method is tested on three typical data sets: HPC2N, CEA Curie and KIT FH2. The root mean square error of the predicted value of operation duration is reduced by 60.30%, 51.91%, and 63.51% respectively compared with the user prediction. Compared with the linear regression method, the root mean square error of the predicted job duration is reduced by 44.37%, 31.98%, and 52.69%, respectively. [Conclusions] The experimental results show that this method can greatly improve the accuracy of job duration prediction. The average job waiting time is reduced by 9.07%, 8.80%, and 1.83% respectively compared with the user prediction, and the performance of job backfill scheduling is improved. The average bounded deceleration value is reduced by 7.72%, 0.96%, and 9.05% respectively compared with the user prediction, improving the utilization of computing resources of supercomputing clusters.

Key words: ensemble learning, supercomputing cluster, job duration prediction, backfill scheduling strategy