数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (2): 31-39.

doi: 10.11871/jfdc.issn.2096-742X.2020.02.003

所属专题: “数据分析技术与应用”专刊

• 专刊: 数据分析技术与应用 • 上一篇    下一篇

IA:一种科学数据云分析服务管理引擎

孟珍1,2,王学志1,2,谢志敏3,胡良霖1,2,陈之端2,4,马俊才2,5,佟继周2,6,张艳玲7,*(),周园春1,2,*()   

  1. 1. 中国科学院计算机网络信息中心,北京 100190
    2. 中国科学院大学,北京 100049
    3. 海参军事海洋环境建设办公室,北京 100081
    4. 中国科学院植物研究所,北京 100093
    5. 中国科学院微生物研究所,北京 100101
    6. 中国科学院国家空间科学中心,北京 100190
    7. 中国烟草总公司郑州烟草研究院,河南 郑州 450001
  • 收稿日期:2020-01-07 出版日期:2020-04-20 发布日期:2020-06-03
  • 通讯作者: 张艳玲,周园春
  • 作者简介:孟珍,中国科学院计算机网络信息中心,高级工程师,硕士研究生导师,大数据技术与应用发展部数据资源与应用实验室副主任,主要研究方向为多源异构数据的融合管理与关联技术、面向领域大数据分析模型与云服务技术。发表SCI/EI收录论文20多篇。
    本文主要承担IA方法研究及应用。
    Meng Zhen is a senior engineer and the master supervisor at the Department of Big Data Technology and Application Development at Computer Network Information Center, Chinese Academy of Sciences. She is the deputy director of the Resource and Application Lab at the Department of Big Data Technology and Application Development. Her research interests include big data management, processing, mining, analysis and other related technologies. And she has published over 20 papers included in SCI/EI.
    In this paper she is mainly responsible for methods research and platform overview of IA.
    E-mail: zhenm99@cnic.cn|周园春,中国科学院计算机网络信息中心,博士,研究员,博士生导师,中国科学院特聘研究员,中心主任助理,中心学位评定委员会主席,大数据技术与应用发展部主任,大数据分析与计算技术国家地方联合工程实验室秘书长,国家烟草专卖局烟草科研大数据重大专项技术首席。发表SCI/EI收录论文90多篇。主要研究方向为云计算、大数据分析与处理。
    本文主要承担工作为IA整体架构设计。
    Zhou Yuanchun is the research fellow, Ph.D. supervisor and the assistant director in Computer Network Information Center, Chinese Academy of Sciences and the director of the Department of Big Data Technology and Application Development. He is also the chairman of the Degree Evaluation Committee in Computer Network Information Center, Chinese Academy of Sciences. His research interests include cloud computing, big data analysis and processing. He has published more than 90 papers included in SCI/EI.
    In this paper he is mainly responsible for the overall framework design of IA.
  • 基金资助:
    中国科学院战略先导专项(XDB31000000);中国烟草总公司科技重大专项(110201901025SJ-04);广东省生物医药计算重点实验室(2016B030301007);中国科学院海洋大科学研究中心重点部署项目(COMS2019Q17);中国科学院“十三五”信息化建设专项(XXH13504-03);中国科学院“十三五”信息化建设专项(XXH13506-102)

IA: An Interactive Analysis Service Management Engine in Scientific Data Cloud

Meng Zhen1,2,Wang Xuezhi1,2,Xie Zhimin3,Hu Lianglin1,2,Chen Zhiduan2,4,Ma Juncai2,5,Tong Jizhou2,6,Zhang Yanling7,*(),Zhou Yuanchun1,2,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. Naval Military Marine Environment Construction Office,Beijing 100081, China
    4. Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
    5. Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
    6. National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China
    7. Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, Henan 450001, China
  • Received:2020-01-07 Online:2020-04-20 Published:2020-06-03
  • Contact: Zhang Yanling,Zhou Yuanchun

摘要:

【目的】随着科学大数据技术的发展,问题导向的数据端分析成为常态。科学数据处理以云计算的形式跑在数据端,并提供安全的用户访问方式、可选的算法资源库、高效的数据存取接口、便捷的用户交互工具、有效扩展的计算和存储资源,将有力提升科学家的数据分析探索效率。【方法】本文提出一种基于容器技术的科学数据端云分析服务管理引擎设计方案:资源节点以自动注册的方式进行横向扩展,资源节点可以是物理主机或虚拟主机;当在用资源达到阈值,管理节点通过接口启动资源节点的注册,同时资源入池;可选的算法资源库、高效的数据和计算访问接口均以容器镜像的方式进行版本控制,在构造资源池时选用。容器实例池的健康度在节点内部进行维护,根据用户的最长使用时间、静默时间等进行实例生命周期管理;内部资源池的容器实例有准备中、准备好、使用中、消亡中几种状态,并始终维护资源池的固定大小。用户认证访问时,根据用户的领域算法库的选择和资源池的使用率进行新用户资源的接入,并通过代理配置提供唯一的标识入口以供用户访问;用户以安全加密的网络访问方式访问交互编程组件或交互应用组件,即可使用数据端的数据资源和计算资源。每个交互组件均在独立的容器实例中,可以进行有效的资源隔离。【结果】基于以上科学数据端云分析服务管理引擎构建的交互分析云服务系统IA(Interactive Analysis Cloud Service System)V1.0,实现了科学数据端云分析资源的统一管理服务,可以通过服务门户直接面向终端科学家使用,也可以通过API接口以docker容器交付的方式给其它现有数据系统调用。已逐步构建生命健康、生态环境、气象水文等领域的科学数据端云分析服务,已应用于中国科学院战略先导专项A、中国科学院战略先导专项B、国家烟草专卖局重大专项等重大项目;已应用于国家微生物科学数据中心、国家空间科学数据中心等国家科学数据中心;已应用于地理空间数据云、DarwinTree分子数据与应用环境等领域公共平台。并提供面向R、TensorFlow、Data Science、All Spark等的常用工具服务,用户可以https的方式访问交互编程组件(iJupyter)或交互应用组件(iWorkflow),即可使用数据端的数据资源和计算资源。

关键词: 数据端分析, 大数据, 容器技术, IA, 云服务及管理

Abstract:

[Objective] With the development of scientific big data technology, problem-oriented analysis becomes normal case. Therefore, in views of the high cost of data migration and the reliance of data analysis on scientific big data, it is necessary to provide a scientific data analysis service engine in the data cloud, providing efficient extended computing and storage resources, optional algorithm resource libraries, and high-efficiency access interfaces with convenient user interaction tools and secure user access policy. Then, scientists can get rid of problems including large-scale data migration and adaptation to programming languages, algorithm environments, version issues, and resource calls, etc.. [Methods] An interactive analysis service management engine in scientific data cloud is presented. In our solution, resource nodes are scaled out through automatic registration. Resource nodes can be physical hosts or virtual hosts. When the utilization rate of computing resources reaches the threshold, the management node starts resource registration. Subsequently, a resource host is to be registered and the available container instances are added into the pool. The optional algorithm resource libraries, high-efficiency access interfaces for data resources and computing resources are versioned in the form of container mirrors for constructing the computing resource pools. The health of the container instance pool is maintained inside of the host. The instance lifecycle management is performed according to the maximum usage time and maximum silent time of each instance. With the always maintained fixed size resource pool, the container instance of the internal resource pool is in one out of four states, that is, preparing, ready, in use, and disappearing. There are several components set in the scientific analysis service system, including the proxy component, the orchestration module component, the user authentication component, the monitoring management component, buffer component, and a cache database. When a user accesses, the resources are conveyed according to the algorithm library selection and resource pool utilization rate, and a unique identity port (PID) is assigned for user access through proxy configuration. The access is in a secure encrypted network to interact with programming components or interactive application components that can use data and computing resources on the cloud. Each interactive component is in a separate container instance for effective resource isolation. [Results] Based on the interactive analysis service management engine in scientific data cloud, iAnalysis (IA for short), an interactive analysis cloud service system V1.0, gives a unified cloud resource management service for scientific data analysis. It can not only be used directly by end-user scientists through the IA's service portal, but also be called by other existing data systems in the form of docker container. By now, IA has provided several scientific cloud analysis services in the fields of life and health, ecological environment, meteorology, and hydrology, etc. It has been applied to major projects such as the Strategic Priority Research Program of the Chinese Academy of Sciences (both A and B) and the Major Project of the State Tobacco Monopoly Administration. It has also been applied to several National Scientific Data Centers, such as the National Microbial Science Data Center, National Space Science Data Center, and public platforms such as GSCloud (www.gscloud.cn) and DarwinTree (www.darwintree.cn). It also provides common coding tools for “R”, “TensorFlow”, “Data Science”, “All Spark”, and so on. Users can access the interactive programming component (iJupyter) or interactive application component (iWorkflow) through https to use data resources and computing resources of the data cloud.

Key words: analysis in data cloud, big data, container technology, IA, cloud services and management