数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (3): 87-100.

doi: 10.11871/jfdc.issn.2096-742X.2020.03.008

所属专题: 下一代互联网络技术与应用

• 专刊:下一代互联网络技术与应用(上) • 上一篇    下一篇

基于深度学习的无监督KPI异常检测

张圣林1(),林潇霏1(),孙永谦1,*(),张玉志1(),裴丹2()   

  1. 1. 南开大学,软件学院,天津 300350
    2. 清华大学,计算机科学与技术系,北京 100084
  • 收稿日期:2020-04-10 出版日期:2020-06-20 发布日期:2020-08-19
  • 通讯作者: 孙永谦
  • 作者简介:张圣林,南开大学软件学院,博士,讲师,主要研究数据中心网络中的故障检测、诊断和预测。发表SCI/EI收录论文15篇以上。
    本文主要设计文章架构并修改论文。
    Zhang Shenglin is currently an assistant professor in the College of Software, Nankai University, Tianjin and Beijing, China. His current research interests include failure detection, diagnosis and prediction in data center networks. He has published 15 papers that are indexed by SCI/EI.
    In this paper he is mainly responsible for the design of the paper architecture and paper revision.
    E-mail: zhangsl@nankai.edu.cn|林潇霏,南开大学软件学院研究生在读,主要研究异常检测和深度学习。
    本文主要承担文献调研及实验。
    Lin Xiaofei is currently a master student in the College of Software at Nankai University, Tianjin, China. Her research interests include anomaly detection and deep learning.
    In this paper he is mainly responsible for the related work investigation and experimental evaluation.
    E-mail: filler.helloworld@gmail.com|孙永谦,南开大学软件学院,博士,讲师,主要研究异常检测、根本原因定位以及数据中心的高性能切换。
    本文主要承担文献调研。
    Sun Yongqia is currently an assistant professor in the College of Software, Nankai University, Tianjin, China. His research interests include anomaly detection, root cause localization, and high performance switching in datacenter.
    In this paper he is mainly responsible for the related work investigation.
    E-mail: sunyongqian@nankai.edu.cn|张玉志,南开大学软件学院,院长,博士,讲席教授,主要研究方向为人工智能。
    本文主要承担文献调研及指导。
    Zhang Yuzhi is currently a distinguished professor and the dean of the College of Software, Nankai University. His research interests include deep learning and other aspects in artificial intelligence.
    In this paper he is mainly responsible for the related work investigation.
    E-mail: zyz@nankai.edu.cn|裴丹,清华大学计算机系,博士,副教授,主要研究方向为网络和服务管理。
    本文主要承担文献调研。
    Pei Dan is currently an associate professor in the Department of Computer Science and Technology, Tsinghua University. His research interests include network and service management in general.
    In this paper he is mainly responsible for the related work investigation.
    E-mail: peidan@tsinghua.edu.cn
  • 基金资助:
    国家重点研发计划(2018YFB0204304)

Research on Unsupervised KPI Anomaly Detection Based on Deep Learning

Zhang Shenglin1(),Lin Xiaofei1(),Sun Yongqian1,*(),Zhang Yuzhi1(),Pei Dan2()   

  1. 1. College of Software, Nankai University, Tianjin 300350,China
    2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2020-04-10 Online:2020-06-20 Published:2020-08-19
  • Contact: Sun Yongqian

摘要:

【目的】关键性能指标(Key Performance Indicator, KPI)异常检测作为互联网智能运维的基础,对快速故障发现和修复具有重要意义。【文献范围】本文重点调研国内外基于深度生成模型的无监督KPI异常检测方法。【方法】系统地阐述了Donut、Bagel和Buzz三种无监督KPI异常检测方法的理论模型,并分析了它们在准确性和效率等方面的优势与不足。【结果】本文基于生产环境中的KPI数据验证了三个方法的性能。【局限】基于深度生成模型的KPI异常检测方法仍在不断地演进,未来将探索更多该领域的新方法。【结论】针对不同特征的KPI数据,需要采用不同的深度生成模型:对于时间信息敏感的KPI数据,需要采用Bagel进行异常检测;对于非周期性的复杂KPI数据,需要采用Buzz检测其异常行为。

关键词: 深度学习, 无监督学习, 关键性能指标, 异常检测, 生成模型

Abstract:

[Objective] Automatic key performance indicator (KPI), the basis of Internet artificial intelligence operations (AIOps), is of vital importance to rapid failure detection and mitigation. [Scope of the literature] In this paper, we investigate unsupervised KPI anomaly detection methods, which are based on deep generative models. [Methods] We systematically describe the theoretic model of Donut, Bagel, and Buzz, which are all unsupervised KPI anomaly detection methods, and analyze their advantages and limitations in terms of accuracy and efficiency. [Results] We evaluate the performance of those three approaches based on real-world KPI data. [Limitations] The KPI anomaly detection methods based on deep generative model are continuously evolving, and we will explore more methods in this area. [Conclusions] Choosing a deep generative model should consider the characteristics of KPI data. Generally, if the KPI data is sensitive to timing information, we should apply Bagel to perform anomaly detection. Moreover, Buzz should be used if the data is non-seasonal and complex.

Key words: deep learning, unsupervised learning, key performance indicator, anomaly detection, generative model