数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (4): 105-120.

doi: 10.11871/jfdc.issn.2096-742X.2020.04.009

所属专题: 下一代互联网络技术与应用

• 技术与应用 • 上一篇    下一篇

材料数据挖掘与机器学习工具的集成与优化

董家源1,2(),杨小渝1,2,*()   

  1. 1.中国科学院计算机网络信息中心,北京 100190
    2.中国科学院大学,北京 100049
  • 收稿日期:2020-03-25 出版日期:2020-08-20 发布日期:2020-09-10
  • 通讯作者: 杨小渝
  • 作者简介:董家源,中国科学院计算机网络信息中心,在读硕士研究生,主要研究方向为材料信息学。
    本文承担工作为:材料数据挖掘与机器学习工具以及相关算法的代码实现。
    Dong Jiayuan is a master student at Computer Network Information Center of the Chinese Academy of Science. His main research interests are Materials informatics.
    In this paper he undertakes the following tasks: code implementations of material data mining platform and related algorithms.
    E-mail: dongjiayuan@cnic.cn|杨小渝,中国科学院计算机网络信息中心,研究员,主要研究方向为高通量材料计算,材料信息学。
    本文承担工作为:想法思路的提出,材料数据挖掘与机器学习工具的架构设计、用户界面设计、相关算法的优化方法设计。
    Prof. Xiaoyu Yang is a research fellow at Computer Network Information Center, the Chinese Academy of Sciences. His research interests include high-throughput materials simulation and materials informatics.
    His work undertook in this paper includes: the proposal of the idea, system architectural design, user interface design, and associated processing logic design.
    E-mail: kxy@cnic.cn

Integration and Optimization of Material Data Mining and Machine Learning Tools

Dong Jiayuan1,2(),Yang Xiaoyu1,2,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2020-03-25 Online:2020-08-20 Published:2020-09-10
  • Contact: Yang Xiaoyu

摘要:

【目的】针对材料科学工作者开展机器学习工作门槛较高这一现状,本文基于MatCloud研发一个用户友好、自动化的材料数据挖掘与机器学习模块Auto-Mat。【方法】本文对MatMiner和scikit-learn中一些已有的获取数据的方法和机器学习算法进行了集成,并定义了数据字典以读取不同材料计算数据库的数据。同时,自主研发了一些特征筛选和处理方面的算法。【结果】能够提供一个具有可视化交互和展示界面的材料数据挖掘与机器学习模块,并将数据以统一的格式呈现。同时,自主研发的算法,对模型的性能均有一定提升。【局限】对于数据的获取,目前仅仅能获取到通过MatMiner API中的数据,相关代码的编写也完全和MatMiner API保持同步,因此可扩展性较差。而且,目前一些核心算法的执行速度有待提升。【结论】通过该模块与MatCloud的集成,用户可以“一站式”地读取Materials Project等几个主流数据库中的数据,并快速构建属于自己的材料数据挖掘与机器学习工作流程。并在最后通过2个案例的对比分析,说明了该模块对于降低用户开展材料数据挖掘与机器学习的使用门槛有着积极作用。

关键词: 材料科学, 数据挖掘, 可视化交互界面, 数据汇总, 特征提取, 模拟退火算法, MatCloud

Abstract:

[Objective] Aiming at handling the current situation that there are high barriers impeding materials science researchers to take advantages of machine learning algorithms, this article focuses on developing a user-friendly and highly automated machine learning system for material data mining named Auto-Mat. [Methods] We have integrated some existing methods and machine learning algorithms in MatMiner and scikit-learn, and defined a data dictionary to read data from different material calculation databases. At the same time, we have developed some algorithms for feature selection and processing. [Results] It can provide the system with a visual interaction and display interface for data mining and machine learning modules under a unified data format. With the optimized algorithms, the performances of models are improved. [Limitations] For data acquisition, currently only the data in the MatMiner API can be obtained, and the writing of related code is also fully synchronized with the MatMiner API. So the scalability is poor. Moreover, at present, the execution speed of some core algorithms needs to be improved. [Conclusions] Through this system, users can read data from several mainstream databases such as Materials Project in one shot and quickly build their own material data mining workflow. In the end, a comparative analysis of two cases shows that our platform has a positive effect on reducing the barriers for users to use machine learning methods on material data mining.

Key words: materials science, data mining, visual interactive interface, data summary, feature extraction, simulated annealing algorithm, MatCloud