数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (4): 77-86.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.006

doi: 10.11871/jfdc.issn.2096-742X.2024.04.006

• 专刊:面向国家科学数据中心的基础软件栈及系统 • 上一篇    下一篇

大规模动态图版本化管理:需求、技术与挑战

曾成林1,2(),王华进1,2,朱小杰1,2,沈志宏1,2,*()   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,北京 100049
  • 收稿日期:2024-02-05 出版日期:2024-08-20 发布日期:2024-08-20
  • 通讯作者: *沈志宏(E-mail: bluejoe@cnic.cn
  • 作者简介:曾成林,中国科学院计算机网络信息中心,中国科学院大学,博士研究生,CCF会员,主要研究方向为图数据库、数据管理。
    本文承担工作为文献调研,实验数据收集,实验设计,论文写作。
    ZENG Chenglin, member of China Computer Federation (CCF), a Master’s student at the Computer Network Information Center, Chinese Academy of Sciences, and the University of Chinese Academy of Sciences. His main research interests are graph databases and data management.
    In this paper, he undertakes the following tasks: literature review, experimental data collection, experimental design, and paper writing.
    E-mail: zengchenglin@cnic.cn|沈志宏,中国科学院计算机网络信息中心,大数据技术与应用发展部主任,同时担任大数据分析与计算技术国家地方联合工程实验室总工程师。正研级高级工程师,博士生导师,CCF会员。主要研究方向为科学大数据、图数据管理技术。主持开发了大数据流水线PiFlow、异构数据融合管理系统PandaDB等开源软件。
    本文承担工作为论文整体框架设计,研究指导。
    SHEN Zhihong, Ph.D., professor, doctoral supervisor. Director of the Department of Big Data Technology and Application Development at the Computer Network Information Center, Chinese Academy of Sciences. Simultaneously serving as the Chief Engineer of the National-Local Joint Engineering Laboratory for Big Data Analysis and Computing Technology. He holds the position of Senior Engineer at the research level, doctoral supervisor, member of China Computer Federation (CCF), and his main research focuses on scientific big data and graph data management technology. He has led the development of open-source software such as the big data pipeline PiFlow and the heterogeneous data fusion management system PandaDB.
    In this paper, he undertakes the following tasks: the overall framework design and research guidance of the thesis.
    E-mail: bluejoe@cnic.cn
  • 基金资助:
    国家重点研发计划项目“面向国家科学数据中心的基础软件栈及系统”(2021YFF0704200);中国科学院“十四五”网信专项工程建设项目“科学大数据工程(三期)”(CAS-WX2022GC-02)

Large Scale Dynamic Graph Version Management: Requirements, Technologies, and Challenges

ZENG Chenglin1,2(),WANG Huajin1,2,ZHU Xiaojie1,2,SHEN Zhihong1,2,*()   

  1. 1. Computer Network Information Center, The Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-02-05 Online:2024-08-20 Published:2024-08-20

摘要:

【目的】 在大数据时代,从生活生产到科研领域,产生了海量的动态图数据,对这些数据进行管理和分析可以有效地辅助人们进行工艺设计、智能决策和科学研究。【文献范围】本文使用动态图、演化图和版本图管理等关键字在知网和谷歌学术上进行搜索,整理收集了几十篇相关文献。【方法】 本文以数据模型、管理系统和挖掘分析方法三大类为依据对相关研究文献进行分类和归纳总结,分析了国内外研究现状。【结果】 其中对动态图数据三种主流存储策略的空间消耗情况在理论上进行了研究并得到了初步结论,其次从集合的角度对现有的动态图查询需求进行了更深层次的总结,最后从论文的分类数量发现当前对动态图的研究更侧重于挖掘分析方面。【局限】本文整理收集的相关文献包含的图模型以属性图为主,RDF相关的文献并未涉及。【结论】 本文在分析了大规模动态图版本化管理的相关需求与技术后,也提出了存在的一些挑战,包括动态图多版本管理带来的高空间膨胀率、指定版本的高效随机检索和版本之间的演变关系精准刻画等问题。

关键词: 动态图, 版本化管理, 图数据

Abstract:

[Objective] In the era of big data, from daily life and production to scientific research, a massive amount of dynamic graph data has been generated. Managing and analyzing this data can effectively assist people in process design, intelligent decision-making, and scientific research. [Coverage] This article used keywords such as dynamic graph, evolution graph, and version graph management to search on CNKI and Google Scholar, and collected dozens of relevant literature. This article classifies and summarizes relevant research literature based on three categories: data models, management systems, and mining analysis methods, and analyzes the current research status at home and abroad. [Results] A theoretical study was conducted on the spatial consumption of three mainstream storage strategies for dynamic graph data, and preliminary conclusions were obtained. Secondly, a deeper summary of existing dynamic graph query requirements was conducted from the perspective of sets. Finally, based on the number of classifications in the paper, it was found that current research on dynamic graphs focuses more on mining and analysis. [Limitations] The relevant literature collected in this article mainly includes attribute graphs, and RDF related literature has not been covered. [Conclusions] After analyzing the relevant requirements and technologies for large-scale dynamic graph version management, this article also proposes some challenges, including the high spatial inflation brought by multi version management of dynamic graphs, efficient random retrieval of specified versions, and precise characterization of evolution relationships between versions.

Key words: dynamic graph, version management, graph data