Frontiers of Data and Computing ›› 2023, Vol. 5 ›› Issue (2): 164-174.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.02.013

doi: 10.11871/jfdc.issn.2096-742X.2023.02.013

• Technology and Application • Previous Articles    

A Job Duration Prediction Method for Computing Clusters Based on Ensemble Machine Learning

LI He1,2(),XIU Hanwen1,2,LIU Yanjun3,CAO Rongqiang1,2,*(),ZHOU Chunbao1,2,WANG Yangang1,2   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
    3. School of Software, Beihang University, Beijing 100191, China
  • Received:2022-08-23 Online:2023-04-20 Published:2023-04-24
  • Contact: CAO Rongqiang E-mail:lihe@cnic.cn;caorq@cnic.cn

Abstract:

[Objective] This paper presents a new method to improve the accuracy of job duration prediction, increase the performance of job backfill scheduling, and improve the utilization of computing resources of computing clusters. [Context] Job scheduling plays an essential role in improving the utilization of computing resources in supercomputing clusters, and job duration prediction is the key judgment basis for job backfill scheduling strategy. [Methods] Based on the ensemble learning method, this paper predicts the job duration of the computing cluster, and integrates algorithms such as support vector regression, random forest, gradient lifting regression tree, and automatic machine learning. The job duration predicted by this method is used to conduct the job backfill scheduling experiment. [Results] The method is tested on three typical data sets: HPC2N, CEA Curie and KIT FH2. The root mean square error of the predicted value of operation duration is reduced by 60.30%, 51.91%, and 63.51% respectively compared with the user prediction. Compared with the linear regression method, the root mean square error of the predicted job duration is reduced by 44.37%, 31.98%, and 52.69%, respectively. [Conclusions] The experimental results show that this method can greatly improve the accuracy of job duration prediction. The average job waiting time is reduced by 9.07%, 8.80%, and 1.83% respectively compared with the user prediction, and the performance of job backfill scheduling is improved. The average bounded deceleration value is reduced by 7.72%, 0.96%, and 9.05% respectively compared with the user prediction, improving the utilization of computing resources of supercomputing clusters.

Key words: ensemble learning, supercomputing cluster, job duration prediction, backfill scheduling strategy