数据与计算发展前沿 ›› 2021, Vol. 3 ›› Issue (4): 104-115.

doi: 10.11871/jfdc.issn.2096-742X.2021.04.009

• 技术与应用 • 上一篇    下一篇

基于无监督学习的可持续发展目标数据分类

雷声1,2(),黎建辉1,*(),张丽丽1()   

  1. 1.中国科学院计算机网络信息中心,北京 100190
    2.中国科学院大学,北京 100049
  • 收稿日期:2021-02-04 出版日期:2021-08-20 发布日期:2021-08-30
  • 通讯作者: 黎建辉
  • 作者简介:雷声,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,研究方向为自然语言处理、机器学习、无监督学习等。
    本文中负责数据处理、模型设计与实验、论文撰写。
    LEI Sheng is a graduate student of Com-puter Network Information Center of Chinese Academy of Sciences. Her research directions are Natural Language Process, Machine Learning, unsupervised learning, etc.
    In this paper, she is responsible for data processing, model design and experimentation, and thesis writing.
    E-mail: leisheng@cnic.cn|黎建辉,中国科学院计算机网络信息中心,博士,研究员,博士生导师,发表论文80余篇,主要研究方向为数据密集型计算与应用、大数据资源开放共享、大数据挖掘与应用等。
    本文负责分类算法框架指导和实验指导。
    LI Jianhui, PhD, is a researcher and doctoral supervisor at Com-puter Network Information Center, Chinese Academy of Sciences. He has published more than 80 papers. His main research directions are data-intensive computing and applications, open sharing of big data resources, and big data mining and applications.
    In this paper, he is responsible for algorithm framework design and experimental guidance.
    E-mail: lijh@cnic.cn|张丽丽,中国科学院计算机网络信息中心,高级工程师,主要研究方向为开放科学、开放数据技术政策,信息经济学。
    本文中负责模型指导和实验指导。
    ZHANG Lili is a senior engineer at Computer Network Infor-mation Center, Chinese Academy of Sciences. Her main research directions are open science, open data technology policy, and information economics.
    In this paper, she is responsible for model guidance and exper-imental guidance.
    E-mail: zhll@cnic.cn
  • 基金资助:
    中国科学院A类战略性先导科技专项(XDA19020104)

Data Classification of the Sustainable Development Goals Based on Unsupervised Learning

LEI Sheng1,2(),LI Jianhui1,*(),ZHANG Lili1()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2021-02-04 Online:2021-08-20 Published:2021-08-30
  • Contact: LI Jianhui

摘要:

【目的】联合国可持续发展目标(SDGs)是联合国于2015年提出的指导全世界在2015-2030年间发展方向的目标,涵括了社会、经济、环境三个方向上的海量数据。针对SDGs标注数据少、数据量大、难以查找利用的特点,本文旨在无监督地对SDGs数据进行分类。【方法】本文首先利用结合textrank和相对词频的关键词提取算法从SDGs元数据集中提取类别描述信息,再利用基于词向量的无监督文本分类算法对SDGs数据进行了分类。【结果】在联合国官方提供的SDGs数据库上的分类实验表明,本文分类模型的F1-micro score达到了0.813,对比SeedBTM提高了33%,相较于不擅长短文本分类的STM及DescLDA上更是分别提升了39%和 52%,对比使用TFIDF和textrank所提取关键词的分类效果分别提升了7%和25%。【结论】本文所提基于textrank和相对词频的关键词提取方法具有较好地可用性,且相较于目前主流的主题模型算法,本文所提基于词向量的无监督分类方法能够取得更好的效果。

关键词: 可持续发展目标, 无监督学习, 提取, 文本分类

Abstract:

[Objective] The Sustainable Development Goals (SDGs) are the goals proposed by the United Nations in 2015 to guide the direction of world development from 2015 to 2030, which include massive amounts of data in three aspects of society, economy, and environment. Since a huge amount of SDGs data are rarely labeled and hard to use, this paper tries to classify them with unsupervised models. [Methods] In this paper, we firstly use a keyword extraction algorithm combining textrank and relative word frequency to extract category description information from SDGs metadata set and then develop an unsupervised text classification algorithm based on word vectors to classify SDGs data. [Results] The experiments on the official SDGs database provided by the United Nations show that the F1-micro score of the proposed method reaches 0.813, which is 33%, 39%, 52% higher than SeedBTM, STM, and DescLDA models, respectively. When compared with TFIDF and textrank, keywords extracted by our model also outperform TFIDF and textrank with 7% and 25% higher F1-micro scores respectively. [Conclusions] The keyword extraction method proposed in this paper based on textrank and relative word frequency is effective. Compared with the current mainstream topic model algorithms, the unsupervised classification method based on word vector achieves better results.

Key words: Sustainable Development Goals, unsupervised learning, text classification, extraction