数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (3): 46-65.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.004

doi: 10.11871/jfdc.issn.2096-742X.2022.03.004

• 专刊:先进智能计算平台及应用(下) • 上一篇    下一篇

KPI异常检测方法评估

孙永谦1,2(),张茹茹1(),林子涵1(),张圣林1,2,3,*(),谭智元1(),张玉志1,2,3()   

  1. 1.南开大学,软件学院,天津 300350
    2.天津市操作系统企业重点实验室,天津 300350
    3.先进计算与关键软件(信创)海河实验室,天津 300350
  • 收稿日期:2022-02-14 出版日期:2022-06-20 发布日期:2022-06-20
  • 通讯作者: 张圣林
  • 作者简介:孙永谦,南开大学,软件学院,博士,讲师,主要研究异常检测、根本原因定位以及数据中心的高性能切换。
    本文中主要设计文章架构并修改论文。
    SUN Yongqian is currently an assistant professor in the College of Software, Nan-kai University, Tianjin, China. His research interests include anomaly detection, root cause localization, and high perfor-mance switching in datacenter.
    In this paper, he is mainly responsible for the design of the paper architecture and paper revision.
    E-mail: sunyongqian@nankai.edu.cn|张茹茹,南开大学,软件学院,硕士研究生,主要研究异常检测和根本原因定位。
    本文中主要承担文献调研、实验及评估。
    ZHANG Ruru is currently a master’s student in the College of Software at Nan-kai University, Tianjin, China. Her research interests include anomaly detection and root cause localization.
    In this paper, she is responsible for the related work investigation, experiments, and evaluation.
    E-mail: 1852917912@qq.com|林子涵,南开大学,软件学院,硕士研究生,主要研究异常检测、根本原因定位。
    本文中承担文献调研及实验。
    LIN Zihan is currently a master’s stud-ent in the College of Software at Nankai University, Tianjin, China. His research interests include anomaly detection and root cause localization.
    In this paper, he is responsible for the related work investigation and experiments.
    E-mail: 2120210568@mail.nankai.edu.cn|张圣林,南开大学,软件学院,博士,副教授,主要研究数据中心网络中的故障检测、诊断和预测。发表SCI/EI 收录论文 15 篇以上。
    本文中承担文献调研及指导。
    ZHANG Shenglin is currently an associate professor in the College of Software, Nankai University, Tianjin, China. His current research interests include failure detection, diagnosis and prediction in data center networks. He has published 15 papers that are indexed by SCI/EI.
    In this paper, he is responsible for the related work inves-tigation.
    E-mail: zhangsl@nankai.edu.cn|谭智元,南开大学,软件学院,本科生,主要研究异常检测和深度学习。
    本文中承担文献调研及实验
    TAN Zhiyuan is an undergraduate in the College of Software at Nankai University, Tianjin, China. His main research inter-ests include anomaly detection and deep le-arning.
    In this paper, he is responsible for the related work investigation and experiments
    E-mail: bhbean42@qq.com|张玉志,南开大学,软件学院,院长,博士,讲席教授,主要研究方向为人工智能。
    本文中承担文献调研。
    ZHANG Yuzhi is currently a disting-uished professor and the dean of the College of Software, Nankai Univer-sity, Tianjin, China. His research interests include deep learning and other aspects of artificial intelligence.
    In this paper, he is responsible for the related work investigation.
    E-mail: zyz@nankai.edu.cn
  • 基金资助:
    国家重点研发计划(2018YFB0204304);天津市自然科学基金青年项目(21JCQNJC00180);国家自然科学基金青年项目(61902200);中国博士后科学基金面上项目(2019M651015)

Evaluation of KPI Anomaly Detection Methods

SUN Yongqian1,2(),ZHANG Ruru1(),LIN Zihan1(),ZHANG Shenglin1,2,3,*(),TAN Zhiyuan1(),ZHANG Yuzhi1,2,3()   

  1. 1. College of Software, Nankai University, Tianjin 300350, China
    2. Tianjin Key Laboratory of Operating System, Tianjin 300350, China
    3. Haihe Laboratory of Information Technology Application Innovation, Tianjin 300350, China
  • Received:2022-02-14 Online:2022-06-20 Published:2022-06-20
  • Contact: ZHANG Shenglin

摘要:

【目的】关键性能指标(Key Performance Indicators,KPI,如页面访问量、页面访问延迟、服务器CPU利用率、路由器内存使用率、交换机吞吐量、服务器磁盘I/O等)异常检测作为快速故障发现和修复的基础,对快速发展的云计算技术服务越来越重要。【文献范围】本文广泛调研近年来国内外KPI异常检测的相关工作。【方法】对各发展阶段的KPI异常检测方法深入研究和分析,并挑选出13个代表性方法进行实验评估。【结果】总结整理了其一般性问题、挑战和框架,使用3家国内顶尖互联网公司收集到的KPI数据集从准确性、鲁棒性和效率三个方面评估了以上方法的性能。【结论】这些方法涵盖了基于统计的方法、有监督学习方法、半监督学习方法和无监督学习方法,并各有优劣性。本文的研究和分析为将来的研究人员快速、准确地选择最适合其场景的KPI异常检测方法提供了依据。

关键词: 关键性能指标, 异常检测, 方法评估, 机器学习

Abstract:

[Objective] As the basis of rapid fault discovery and repair, key performance indicator (KPI, such as page view count, page-view delay, server CPU utilization, router memory utilization, switch throughput, server disk I/O) anomaly detection is becoming more and more critical for the rapid development of cloud computing technology services. [Coverage] We extensively investigated the related works of KPI anomaly detection at home and abroad in recent years. [Methods] We conduct in-depth research and analysis on KPI anomaly detection methods at various development stages and select 13 representative methods for experimental evaluation. [Results] We summarize the general problems, challenges, and frameworks. We evaluate the performance of these methods using the KPI dataset collected from three top-tier Internet companies in terms of accuracy, robustness, and efficiency. [Conclusions] These methods cover statistics-based, supervised, semi-supervised and unsupervised methods with advantages and disadvantages. Our research and analysis provide a basis for future researchers to select the most appropriate KPI anomaly detection method quickly and accurately for their scenarios.

Key words: Key performance indicator, anomaly detection, method evaluation, machine learning