Frontiers of Data and Computing ›› 2021, Vol. 3 ›› Issue (4): 104-115.

doi: 10.11871/jfdc.issn.2096-742X.2021.04.009

• Technology and Applicaton • Previous Articles     Next Articles

Data Classification of the Sustainable Development Goals Based on Unsupervised Learning

LEI Sheng1,2(),LI Jianhui1,*(),ZHANG Lili1()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2021-02-04 Online:2021-08-20 Published:2021-08-30
  • Contact: LI Jianhui E-mail:leisheng@cnic.cn;lijh@cnic.cn;zhll@cnic.cn

Abstract:

[Objective] The Sustainable Development Goals (SDGs) are the goals proposed by the United Nations in 2015 to guide the direction of world development from 2015 to 2030, which include massive amounts of data in three aspects of society, economy, and environment. Since a huge amount of SDGs data are rarely labeled and hard to use, this paper tries to classify them with unsupervised models. [Methods] In this paper, we firstly use a keyword extraction algorithm combining textrank and relative word frequency to extract category description information from SDGs metadata set and then develop an unsupervised text classification algorithm based on word vectors to classify SDGs data. [Results] The experiments on the official SDGs database provided by the United Nations show that the F1-micro score of the proposed method reaches 0.813, which is 33%, 39%, 52% higher than SeedBTM, STM, and DescLDA models, respectively. When compared with TFIDF and textrank, keywords extracted by our model also outperform TFIDF and textrank with 7% and 25% higher F1-micro scores respectively. [Conclusions] The keyword extraction method proposed in this paper based on textrank and relative word frequency is effective. Compared with the current mainstream topic model algorithms, the unsupervised classification method based on word vector achieves better results.

Key words: Sustainable Development Goals, unsupervised learning, text classification, extraction