Frontiers of Data and Computing ›› 2026, Vol. 8 ›› Issue (3): 96-109.

doi: 10.11871/jfdc.issn.2096-742X.2026.03.009

• Technology and Application • Previous Articles     Next Articles

Research on Technology for Construction of High-Quality Multi-Source Heterogeneous Data Sets and Analysis & Identification of Associated Sensitivity

WANG Di1(),AN Bing1,*(),FENG Hanyu1,FAN Zihao2,LI Minghan2,RU Yiwei2   

  1. 1 State Grid Corporation of China Big Data Center, Beijing 100052, China
    2 Tianjin Zhongke Intelligent Recognition Co., Ltd, Tianjin 300457, China
  • Received:2025-10-16 Online:2026-06-20 Published:2026-06-18
  • Contact: AN Bing E-mail:wangdi1220@aliyun;anikab@163.com

Abstract:

[Background] In the digital era, multi-source heterogeneous data have experienced explosive growth, and its enormous inherent value has become increasingly prominent. Provincial state grid process over 100 million network access logs daily, covering diverse data types such as numerical data, command category data, and alarm text types data. These data are widely distributed in key business scenarios including dispatching automation systems, the Internet of Things, and new energy grid-connected monitoring, containing huge value in supporting intelligent decision-making of power grids, equipment status prediction, and safety risk prevention and control. However, data quality defects and associated sensitivity risks have become prominent bottlenecks restricting the realization of data value. [Methods] To this end, this paper focuses on technologies for constructing high-quality datasets of multi-source heterogeneous data and analyzing and identifying associated sensitivity. In terms of high-quality dataset construction, the MTabGen method based on a diffusion model is proposed, which realizes high-precision imputation of data defects through multi-modal joint optimization. In the aspect of data association sensitivity analysis, the graph convolutional neural network DGDCN is proposed to construct data association graphs and identify sensitive association paths. [Results] Experimental verification shows that the MTabGen method is significantly superior to traditional data construction methods in terms of accuracy and completeness indicators; the graph convolutional neural network DGDCN comprehensively outperforms traditional machine learning methods in precision, recall, and F1-score.

Key words: multi-source heterogeneous data, dataset construction, data association, data sensitivity