Frontiers of Data and Computing ›› 2024, Vol. 6 ›› Issue (4): 46-58.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.004

doi: 10.11871/jfdc.issn.2096-742X.2024.04.004

• Special Issue: Fundamental Software Stack and Systems for National Scientific Data Centers • Previous Articles     Next Articles

DPML: A Markup Language for Scientific Data Pragmatics

CAI Huaqian1,2(),LIU Yihao1,3,GUAN Tianpeng1,3,WU Kaidong1,2,YANG Jingru1,2,LUO Chaoran1,ZHU Xiaojie4,LIU Jia4,HUANG Gang1,2,*()   

  1. 1. National Key Laboratory of Dataspace Technology and System, Beijing 100091, China
    2. School of Computer Science, Peking University, Beijing 100871, China
    3. School of Software and Microelectronics, Peking University, Beijing 100871, China
    4. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2024-02-04 Online:2024-08-20 Published:2024-08-20

Abstract:

[Objective] The utilization scenarios of scientific data are becoming increasingly diverse. Grasping the manner in which scientific data is used in existing scenarios is of great inspiration and reference value for technological exploration and discovery. However, due to the complex input, algorithm, and execution environments of scientific data applications, it has become a challenge to uniformly describe the scenario-based use of data. The lack of unified description has made it difficult, costly, and inefficient to understand and learn the usage of scientific data in existing scenarios. [Methods] To address the issue of lacking a unified framework for describing data usage scenarios, this article introduces the concept of data pragmatics and employs hypergraphs to model the scenario-based utilization of data. It proposes a novel markup language tailored for data pragmatics, termed Data Pragmatics Markup Language (DPML), and suggests an AI-driven method for the automated extraction of scientific data pragmatics. [Results] DPML can characterize the data pragmatics in typical scenarios of scientific data analysis, and by using the automation methods mentioned above, DPML can be efficiently extracted. [Conclusions] By proposing DPML and its automated extraction method, this paper realizes the automated representation of implicit data pragmatics in the scenario-based scientific data usage. The pragmatic web of scientific data formed through data and the pragmatic relationships between data contains knowledge on how to use scientific data, which can promote interdisciplinary sharing and reuse of scientific data, opening up a new way for in-depth cooperation and data-driven discovery in scientific research.

Key words: scientific data, data semantics, markup language, automated extraction, artificial intelligence