数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (1): 15-27.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.01.002

doi: 10.11871/jfdc.issn.2096-742X.2023.01.002

• 专刊:科学数据资源、技术与政策联合专刊 • 上一篇    下一篇

端到端的科学数据跨中心工作流分析框架

朱小杰1(),王华进1,沈志宏1,*(),郭学兵2,董文1   

  1. 1.中国科学院计算机网络信息中心, 北京 100083
    2.中国科学院地理科学与资源研究所, 北京 100101
  • 收稿日期:2022-12-27 出版日期:2023-02-20 发布日期:2023-02-20
  • 通讯作者: 沈志宏
  • 作者简介:朱小杰,中国科学院计算机网络信息中心,大数据技术与应用发展部,硕士,高级工程师,主要研究方向为大数据管理、处理技术,目前承担有国家重点研发计划子课题、中国科学院网信专项课题。
    本文主要承担工作为:端到端科学数据跨中心工作流分析框架架构设计与部分章节写作。
    ZHU Xiaojie, Master’s degree, senior engineer, currently works in the Big Data Technology and Application Development Department, Computer Network Information Center, Chinese Academy of Sciences. Her main research direction is big data management and processing technology. She is undertaking a sub-project of the National key research and development plan.
    In this paper, she is mainly responsible for the design of the End-to-End Workflow Framework for Cross-center Scientific Data Analysis and the writing of some chapters.
    E-mail: xjzhu@cnic.cn|沈志宏,中国科学院计算机网络信息中心,大数据技术与应用发展部主任,正研级高级工程师,博士生导师,主要研究方向为科学大数据、图数据管理技术,主持开发了大数据流水线PiFlow、异构数据融合管理系统PandaDB等开源软件。
    本文主要承担工作为:跨中心流程化分析方法论。
    SHEN Zhihong, Ph.D., professor, doctoral supervisor, director of the Big Data Technology and Application Development Department, Computer Network Information Center, Chinese Academy of Sciences. His main research direction is scientific big data and graph data management technology. He presided over the development of open-source software such as big the data pipeline PiFlow and the heterogeneous data fusion management system PandaDB.
    In this paper, he is mainly responsible for the methodology of the cross-center analysis workflow.
    E-mail: bluejoe@cnic.cn
  • 基金资助:
    国家重点研发计划“面向国家科学数据中心的基础软件栈及系统”(2021YFF0704200);中国科学院“十四五”网信专项工程建设项目“科学大数据工程(三期)”(CAS-WX2022GC-02)

End-to-End Workflow Framework for Cross-Center Scientific Data Analysis

ZHU Xiaojie1(),WANG Huajin1,SHEN Zhihong1,*(),GUO Xuebing2,DONG Wen1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
  • Received:2022-12-27 Online:2023-02-20 Published:2023-02-20
  • Contact: SHEN Zhihong

摘要:

【目的】大数据与人工智能技术的快速发展催生了科研范式变革,新科研范式普遍要求对不同领域的科学数据资源进行协同分析,任务类型多样,分析流程横跨不同科学数据中心。【应用背景】现有工作流分析框架因在分析流程表达能力、异构计算框架整合能力、跨中心作业调度能力上的不足,难以支撑端到端的科学数据跨中心工作流分析需求。【方法】本文提出了可进行端到端科学数据跨中心工作流分析的软件框架,支持跨中心异构工作流构建、跨框架数据透明传递、跨中心作业优化调度。【结果】基于国家生态科学数据中心“草地地上生物量跨台站在线处理与质量控制”场景,对科学数据跨中心工作流分析框架进行了功能和性能验证,验证了该框架的先进性和可行性。

关键词: 科研范式, 工作流分析, 科学数据中心, 跨中心计算

Abstract:

[Objective] The rapid development of big data and artificial intelligence technology has led to the transformation of research paradigms. New paradigms generally require collaborative analysis. Task types are complex and the analysis process spans different scientific data centers. [Application background] Existing process-based analysis frameworks are difficult to support end-to-end cross-center scientific data analysis requirements due to the lack of the capabilities of analysis process expression, heterogeneous computing framework integration, and cross-center job scheduling. [Methods] A software framework for end-to-end cross-center analysis of scientific data is proposed, which supports cross-center heterogeneous workflow construction, cross-framework data transparent transfer, and cross-center job optimization scheduling. [Results] The function and performance of the proposed framework are verified based on the scenario of "cross-station online processing and quality control of aboveground grass biomass" in the National Ecosystem Science Data Center, which verifies the advancement and feasibility of the framework.

Key words: scientific research paradigm, analysis workflow, scientific data center, cross-center computing