数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (4): 46-58.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.004

doi: 10.11871/jfdc.issn.2096-742X.2024.04.004

• 专刊:面向国家科学数据中心的基础软件栈及系统 • 上一篇    下一篇

DPML: 一种面向科学数据语用的标记语言

蔡华谦1,2(),刘逸豪1,3,关天鹏1,3,吴恺东1,2,杨婧如1,2,罗超然1,朱小杰4,刘佳4,黄罡1,2,*()   

  1. 1.数据空间与系统全国重点实验室,北京 100091
    2.北京大学,计算机学院,北京 100871
    3.北京大学,软件与微电子学院,北京 100871
    4.中国科学院计算机网络信息中心,北京 100083
  • 收稿日期:2024-02-04 出版日期:2024-08-20 发布日期:2024-08-20
  • 通讯作者: *黄罡(E-mail: hg@pku.edu.cn
  • 作者简介:蔡华谦,北京大学计算机学院,博士,副研究员,发表SCI/EI收录论文10余篇,发明专利40余项。主要研究方向为系统软件、分布式系统。
    本文主要承担的工作为DPML语言的整体设计及DPML自动提取框架的设计。
    CAI Huaqian, Ph.D., is an associate researcher at the School of Computer Science, Peking University. He has published over 10 SCI/EI indexed papers and holds more than 40 patents. His primary research interests include system software and distributed systems.
    In this paper, he is mainly responsible for the overall design of the DPML and the design of the DPML automatic extraction framework.
    E-mail: caihq@pku.edu.cn|黄罡,数据空间技术与系统全国重点实验室主任,北京大学教授、博导,主持实现多项关键软件技术与系统的率先突破与重大应用。主要研究方向为系统软件和软件自适应等。
    本文主要承担的工作为数据语用机理与语用网络理论。
    HUANG Gang, Ph.D., director of the National Key Laboratory of Dataspace Technology and System, professor and doctoral supervisor of the Peking University, breakthroughs and applies multiple key software technologies and systems. His main research interests include system software and software adaptation, etc.
    In this paper, he is mainly responsible for proposing the theory of data pragmatics and data pragmatic network.
    E-mail: hg@pku.edu.cn
  • 基金资助:
    国家重点研发计划“面向国家科学数据中心的基础软件栈及系统”(2021YFF0704200)

DPML: A Markup Language for Scientific Data Pragmatics

CAI Huaqian1,2(),LIU Yihao1,3,GUAN Tianpeng1,3,WU Kaidong1,2,YANG Jingru1,2,LUO Chaoran1,ZHU Xiaojie4,LIU Jia4,HUANG Gang1,2,*()   

  1. 1. National Key Laboratory of Dataspace Technology and System, Beijing 100091, China
    2. School of Computer Science, Peking University, Beijing 100871, China
    3. School of Software and Microelectronics, Peking University, Beijing 100871, China
    4. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2024-02-04 Online:2024-08-20 Published:2024-08-20

摘要:

【目的】 科学数据的使用场景日益丰富,了解已有场景中科学数据如何使用,对科技探索与发现有很重要的启发和借鉴作用。然而,由于科学数据的场景化使用蕴藏了复杂的输入、算法和执行环境,这使得如何统一地描述数据的场景化使用成为了一个挑战。这种统一描述的缺失导致了理解和学习已有场景中的科学数据的用法变得难度大、成本高、效率低。【方法】 针对数据场景化使用的统一描述缺失的问题,本文采用数据语用的概念,从超图的角度来建模数据的场景化使用,设计了面向数据语用的新型标记语言DPML(Data Pragmatics Markup Language),并提出了一套基于AI的科学数据语用的自动化提取方法。【结果】 DPML可以表征多种典型的基于科学数据的场景化使用中的数据语用,同时利用上述的自动化方法,可以高效地提取出DPML。【结论】 通过提出DPML及其自动化提取方法,本文实现了科学数据场景化使用中隐含的数据语用的自动化表征。通过数据以及数据之间的语用关系所形成的科学数据的语用网络蕴藏了科学数据如何使用的知识,可以促进科学数据跨学科的共享和再利用,为科学研究的深入合作与数据驱动发现开辟了新的路径。

关键词: 科学数据, 数据语用, 标记语言, 自动化提取, 人工智能

Abstract:

[Objective] The utilization scenarios of scientific data are becoming increasingly diverse. Grasping the manner in which scientific data is used in existing scenarios is of great inspiration and reference value for technological exploration and discovery. However, due to the complex input, algorithm, and execution environments of scientific data applications, it has become a challenge to uniformly describe the scenario-based use of data. The lack of unified description has made it difficult, costly, and inefficient to understand and learn the usage of scientific data in existing scenarios. [Methods] To address the issue of lacking a unified framework for describing data usage scenarios, this article introduces the concept of data pragmatics and employs hypergraphs to model the scenario-based utilization of data. It proposes a novel markup language tailored for data pragmatics, termed Data Pragmatics Markup Language (DPML), and suggests an AI-driven method for the automated extraction of scientific data pragmatics. [Results] DPML can characterize the data pragmatics in typical scenarios of scientific data analysis, and by using the automation methods mentioned above, DPML can be efficiently extracted. [Conclusions] By proposing DPML and its automated extraction method, this paper realizes the automated representation of implicit data pragmatics in the scenario-based scientific data usage. The pragmatic web of scientific data formed through data and the pragmatic relationships between data contains knowledge on how to use scientific data, which can promote interdisciplinary sharing and reuse of scientific data, opening up a new way for in-depth cooperation and data-driven discovery in scientific research.

Key words: scientific data, data semantics, markup language, automated extraction, artificial intelligence