数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (4): 87-95.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.007

doi: 10.11871/jfdc.issn.2096-742X.2024.04.007

• 专刊:面向国家科学数据中心的基础软件栈及系统 • 上一篇    下一篇

基于服务依赖图的微服务系统故障根因定位方法

张齐勋1,*(),贾统2,杨勇3,李影4   

  1. 1.北京大学,软件与微电子学院,北京 102600
    2.北京大学,人工智能研究院,北京 100871
    3.北京大学,信息科学技术学院,北京 100871
    4.北京大学,软件工程国家工程研究中心,北京 100871
  • 收稿日期:2024-02-05 出版日期:2024-08-20 发布日期:2024-08-20
  • 通讯作者: *张齐勋(E-mail: zhangqx@pku.edu.cn
  • 作者简介:张齐勋,北京大学软件与微电子学院, 博士, 副教授, CCF会员,主要研究领域为软件工程,智能运维。
    本文中负责论文撰写,根因定位方法设计。
    ZHANG Qixun, School of Software and Microelectronics, Peking University, Ph.D., Associate Professor, CCF Member, focuses on research areas such as software engineering and Artificial Intelligence for IT Operations.
    In this paper, he is responsible for the paper writing and the design of root cause localization methods.
    E-mail: zhangqx@pku.edu.cn
  • 基金资助:
    国家重点研发计划(2021YFF0704202)

A Root Cause Localization Method Based on Service Dependency Graph for Microservice System Failures

ZHANG Qixun1,*(),JIA Tong2,YANG Yong3,LI Ying4   

  1. 1. School of Software and Microelectronics, Peking University, Beijing 102600, China
    2. Institute for Artificial Intelligence, Peking University, Beijing 100871, China
    3. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
    4. National Engineering Research Center for Software Engineering, Peking University, Beijing 100871, China
  • Received:2024-02-05 Online:2024-08-20 Published:2024-08-20

摘要:

【目的】 为解决微服务架构中频繁出现的系统故障以及异常快速传播的问题,特别是由于服务粒度细、更新迭代频繁及服务依赖复杂性引起的诊断复杂性,本文提出了一种基于动态微服务依赖图的故障根因快速定位方法。【方法】 本方法基于微服务的配置信息和日志数据,动态生成服务依赖图,有效捕获服务间的动态依赖变化。在故障发生时,利用服务依赖图和异常事件数据推断异常间的因果链,构造异常因果关系图。结合服务依赖的权重,通过服务依赖图中搜寻并排序可能的根因节点,以实现异常源头的精准定位。【结果】 实验结果表明,本方法异常根因top 5平均定位精确率达到66%,优于现有其它同类方法。

关键词: 微服务, 服务依赖, 异常因果关系, 根因定位

Abstract:

[Objective] To address the frequent occurrences of system failures and the rapid propagation of anomalies within microservice architectures, particularly due to the complexity of diagnosis caused by fine service granularity, frequent updates, and complex service dependencies, this paper proposes a rapid root cause localization method based on dynamic service dependency graphs. [Methods] This method utilizes configuration information and log data of microservices to dynamically generate service dependency graphs, effectively capturing the dynamic changes in service dependencies. In the event of a failure, it uses the service dependency graph and anomaly event data to infer the causal chain of anomalies and constructs an anomaly causality graph. By considering the weight of service dependencies, it searches and ranks potential root cause nodes in the service dependency graph to accurately locate the source of the anomaly. [Results] Experimental results demonstrate that the proposed method achieves an average precision rate of 66% for top 5 root cause localization, surpassing existing similar methods.

Key words: microservice, service dependency, anomaly causal relationship, root cause localization