数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (2): 145-154.

doi: 10.11871/jfdc.issn.2096-742X.2020.02.012

所属专题: “数据分析技术与应用”专刊

• 技术与应用 • 上一篇    下一篇

一种基于特征选择与迁移学习的数据预测方法

陈通宝1,2,温亮明1,2,黎建辉1,*()   

  1. 1. 中国科学院计算机网络信息中心, 北京 100190
    2. 中国科学院大学, 北京 100049
  • 收稿日期:2020-01-29 出版日期:2020-04-20 发布日期:2020-06-03
  • 通讯作者: 黎建辉
  • 作者简介:陈通宝,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,主要研究方向为推荐技术、大数据挖掘、自然语言处理。
    本文承担工作为:模型设计、实验数据分析、文章撰写。
    Chen Tongbao is a master student in Computer Network Information Center, Chinese Academy of Sciences (University of the Chinese Academy of Sciences). His research interests include recommendation technology, big data mining and natural language processing.
    In this paper, he is mainly responsible for model design, experimental data analysis, and article writing.
    E-mail:chentongbao@cnic.cn|温亮明,中国科学院计算机网络信息中心,中国科学院大学,博士研究生,主要研究方向为科学数据共享、数据资产管理。
    本文承担工作为:参与文章修改。
    Wen Liangming is a Ph.D. student in Computer Network Information Center, Chinese Academy of Sciences (University of the Chinese Academy of Sciences). His research interests include scientific data sharing and data asset management.
    In this paper, he participates in article revisions.
    E-mail:wenliangming@cnic.cn|黎建辉,中国科学院计算机网络信息中心,博士,研究员,博士生导师,研究方向为大数据资源开放共享、大数据管理技术、大数据计算与分析技术等。
    本文承担工作为:文章框架的整体结构设计、研究指导。
    Li Jianhui is the research fellow and the Ph.D. supervisor in Computer Network Information Center, Chinese Academy of Sciences. His research interests include open sharing of big data resources, big data management technology, big data computing and analysis technology, etc.
    He contributed to the organization of the paper and supervised the research.
  • 基金资助:
    中国科学院战略性先导科技专项(A类)子课题:“大数据资源库与门户系统”(XDA19020104)

A Data Prediction Method Based on Feature Selection and Transfer Learning

Chen Tongbao1,2,Wen Liangming1,2,Li Jianhui1,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2020-01-29 Online:2020-04-20 Published:2020-06-03
  • Contact: Jianhui Li

摘要:

【目的】联合国可持续发展目标(Sustainable Development Goals ,SDGs)已经成为全球最重要的可持续发展问题。然而,SDGs指标相关数据高缺失率的现状严重影响了联合国对各国可持续发展目标实行过程的有效监测。研究如何对SDGs中的相关缺失数据进行补全具有重大的技术挑战,也对鞭策各国完成可持续发展目标具备重大意义。【方法】本文提出了一种融合MIC(最大信息系数)进行特征选择的迁移学习方法TLM(一种融合最大信息系数和迁移学习的方法),其能通过其它公开数据为目标变量构造特征,并联合相关回归技术建立数据预测模型,以达到对目标变量的缺失值进行预测的目的。【结果】本文以特定国家中SDGs指标3.2.1的数据集为例,使用TLM方法对目标变量的缺失值进行预测并补全,验证了TLM方法的有效性。【局限】由于影响SDGs指标的波动因素众多,因此,探索更多相关性分析方法并结合TLM方法对缺失值进行更加精确的预测是今后进一步研究的重点方向。【结论】结合了MIC和迁移学习的TLM方法能提升数据预测的准确率,可为SDGs相关领域工作者在处理数据缺失问题时提供重要的参考价值。

关键词: 联合国可持续发展目标, 迁移学习, 回归, 数据缺失, 数据补全方法

Abstract:

[Objective] The Sustainable Development Goals (SDGs) have become the most important sustainable development issue in the world. However, the high rate of missing data related to SDGs indicators has affected the UN’s effective monitoring of implementation of sustainable development goals in various countries. Completion of the missing data in SDGs is technically challenging, and is of great significance in urging countries to achieve sustainable development goals. [Methods] This paper proposes a transfer learning method named TLM, which incorporates with MIC (maximal information coefficient) for feature selection. It can construct features for the target data from other public data and build a prediction model with related regression technology to predict the missing values of the target data. [Results] This article takes the data set of SDGs indicator 3.2.1 in a specific country as an example and uses TLM to predict the missing values of target data. The effectiveness of TLM is verified. [Limitations] Due to the many factors that can affect SDGs indicators, exploring more correlation analysis methods which can be combined with TLM to make more accurate predictions of missing values is the focus of our future research. [Conclusions] The TLM method which combines with MIC and transfer learning can improve the accuracy of data prediction. Besides, it can provide effective reference value predictions for researchers in the related fields of SDGs when dealing with data missing problems.

Key words: sustainable development goals, transfer learning, regression, data missing, data completion methods