数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (4): 96-105.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.008

doi: 10.11871/jfdc.issn.2096-742X.2024.04.008

• 专刊:面向国家科学数据中心的基础软件栈及系统 • 上一篇    下一篇

基于大数据流水线系统的算法模型整合方法研究——以基于机器学习方法的LiDAR数据树木生物量反演为例

郭学兵1,2,*(),朱小杰3,唐新斋1,杨刚3,侯艳飞1,2,何洪林1,2   

  1. 1.中国科学院地理科学与资源研究所,生态系统网络观测与模拟重点实验室,北京 100101
    2.国家生态科学数据中心,北京 100101
    3.中国科学院计算机网络信息中心,北京 100083
  • 收稿日期:2024-02-29 出版日期:2024-08-20 发布日期:2024-08-20
  • 通讯作者: *郭学兵(E-mail: guoxb@igsnrr.ac.cn
  • 作者简介:郭学兵,中国科学院地理科学与资源研究所,生态系统网络观测与模拟重点实验室,高级工程师,研究方向为生态信息学。
    负责本文撰写,基于πFlow软件的LiDAR数据分析处理整合系统的整体设计、研发和组织工作。
    GUO Xuebing is a senior engineer at the Key Laboratory of Ecosystem Network Observation and Modeling, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences. Her research interest covers Eco-Informatics.
    She is responsible for the paper writing, as well as for the overall design, development, and coordination of the LiDAR Data-Model Integration System based on πFlow software.
    E-mail: guoxb@igsnrr.ac.cn
  • 基金资助:
    国家重点研发计划(2022YFF1300100)

Study on Integration Method of Algorithm Model Based on Big Data Pipeline— Taking Tree Biomass Inversion Based on Machine Learning Method and LiDAR Data as an Example

GUO Xuebing1,2,*(),ZHU Xiaojie3,TANG Xinzhai1,YANG Gang3,HOU Yanfei1,2,HE Honglin1,2   

  1. 1. Key Laboratory of Ecosystem Network Observation and Modeling, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
    2. National Ecosystem Science Data Center, Beijing 100101, China
    3. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2024-02-29 Online:2024-08-20 Published:2024-08-20

摘要:

【背景】 激光雷达(LiDAR)数据在森林资源分析利用方面有着广泛应用,科研人员研制了很多涉及大数据管理和人工智能的专业算法模型,这些算法模型目前多数散落在研究人员手里,尚缺乏新型信息化平台对其进行整合。【方法】 大数据流水线系统πFlow软件具有大数据管理能力和大数据算法集成能力,并可以所见即所得方式构建流水线并调度运行流水线,适合于LiDAR数据复杂算法模型的整合,且流水线可定制、可复用。【内容】 本文介绍了πFlow的特点和功能,并以基于LiDAR冠层高度模型(CHM)数据的树冠解析及利用机器学习方法估测树木生物量为例,介绍了将算法整合到πFlow并构建LiDAR数据分析处理流水线的方法和技术,且对流水线进行了测试运行。【结果】 利用πFlow构建的可重复信息化平台可支撑野外站观测网络的LiDAR数据生物量快速反演,为数据密集型的专业数据处理算法模型的整合提供了创新方法技术。

关键词: 大数据流水线, 算法模型集成, 激光雷达, 机器学习, 随机森林, πFlow

Abstract:

[Background] Light Detection and Ranging (LiDAR) data are widely used in the analysis and utilization of forest resources. Researchers have developed many professional algorithm models involving big data management and artificial intelligence. Currently, most of these algorithm models are scattered in the hands of researchers, and there is still a lack of new information platforms to integrate them. [Methods] The big data pipeline system such as πFlow has the capability of big data management and big data algorithm integration, and can build and schedule the pipeline in the way of WYSIWYG (what you see is what you get). It is suitable for integration of complex algorithm models for LiDAR data, and the pipeline can be customized and reused. [Contents] This paper introduces the characteristics and functions of πFlow, taking tree crown segmentation and estimation of tree biomass using machine learning methods based on LiDAR tree canopy height model (CHM) data as an example. The paper presents the method and technology of integrating algorithms into πFlow, constructs a LiDAR data analysis and processing pipeline, and conducts test operations to the pipeline. [Results] The reproducible information platform constructed using πFlow could support fast biomass inversion of LiDAR data for multiple networked observational field sites, which can also provide an innovative technological method for the integration of data-intensive processing algorithm models.

Key words: big data pipeline, algorithm model integration, LiDAR, machine learning, random forest classify, πFlow