基于无监督学习的可持续发展目标数据分类

doi:10.11871/jfdc.issn.2096-742X.2021.04.009

数据与计算发展前沿 ›› 2021, Vol. 3 ›› Issue (4): 104-115.

doi: 10.11871/jfdc.issn.2096-742X.2021.04.009

基于无监督学习的可持续发展目标数据分类

雷声^1,²(),黎建辉^1,^*(),张丽丽¹()

1.中国科学院计算机网络信息中心,北京 100190
2.中国科学院大学,北京 100049

收稿日期:2021-02-04 出版日期:2021-08-20 发布日期:2021-08-30
通讯作者: 黎建辉
作者简介:雷声,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,研究方向为自然语言处理、机器学习、无监督学习等。
本文中负责数据处理、模型设计与实验、论文撰写。
LEI Sheng is a graduate student of Com-puter Network Information Center of Chinese Academy of Sciences. Her research directions are Natural Language Process, Machine Learning, unsupervised learning, etc.
In this paper, she is responsible for data processing, model design and experimentation, and thesis writing.
E-mail: leisheng@cnic.cn|黎建辉,中国科学院计算机网络信息中心,博士,研究员,博士生导师,发表论文80余篇,主要研究方向为数据密集型计算与应用、大数据资源开放共享、大数据挖掘与应用等。
本文负责分类算法框架指导和实验指导。
LI Jianhui, PhD, is a researcher and doctoral supervisor at Com-puter Network Information Center, Chinese Academy of Sciences. He has published more than 80 papers. His main research directions are data-intensive computing and applications, open sharing of big data resources, and big data mining and applications.
In this paper, he is responsible for algorithm framework design and experimental guidance.
E-mail: lijh@cnic.cn|张丽丽,中国科学院计算机网络信息中心,高级工程师,主要研究方向为开放科学、开放数据技术政策,信息经济学。
本文中负责模型指导和实验指导。
ZHANG Lili is a senior engineer at Computer Network Infor-mation Center, Chinese Academy of Sciences. Her main research directions are open science, open data technology policy, and information economics.
In this paper, she is responsible for model guidance and exper-imental guidance.
E-mail: zhll@cnic.cn
基金资助:
中国科学院A类战略性先导科技专项(XDA19020104)

Data Classification of the Sustainable Development Goals Based on Unsupervised Learning

LEI Sheng^1,²(),LI Jianhui^1,^*(),ZHANG Lili¹()

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
2. University of Chinese Academy of Sciences, Beijing 100049, China

Received:2021-02-04 Online:2021-08-20 Published:2021-08-30
Contact: LI Jianhui

摘要/Abstract

摘要：

【目的】联合国可持续发展目标（SDGs）是联合国于2015年提出的指导全世界在2015-2030年间发展方向的目标,涵括了社会、经济、环境三个方向上的海量数据。针对SDGs标注数据少、数据量大、难以查找利用的特点,本文旨在无监督地对SDGs数据进行分类。【方法】本文首先利用结合textrank和相对词频的关键词提取算法从SDGs元数据集中提取类别描述信息,再利用基于词向量的无监督文本分类算法对SDGs数据进行了分类。【结果】在联合国官方提供的SDGs数据库上的分类实验表明,本文分类模型的F1-micro score达到了0.813,对比SeedBTM提高了33%,相较于不擅长短文本分类的STM及DescLDA上更是分别提升了39%和 52%,对比使用TFIDF和textrank所提取关键词的分类效果分别提升了7%和25%。【结论】本文所提基于textrank和相对词频的关键词提取方法具有较好地可用性,且相较于目前主流的主题模型算法,本文所提基于词向量的无监督分类方法能够取得更好的效果。

关键词: 可持续发展目标, 无监督学习, 提取, 文本分类

Abstract:

[Objective] The Sustainable Development Goals (SDGs) are the goals proposed by the United Nations in 2015 to guide the direction of world development from 2015 to 2030, which include massive amounts of data in three aspects of society, economy, and environment. Since a huge amount of SDGs data are rarely labeled and hard to use, this paper tries to classify them with unsupervised models. [Methods] In this paper, we firstly use a keyword extraction algorithm combining textrank and relative word frequency to extract category description information from SDGs metadata set and then develop an unsupervised text classification algorithm based on word vectors to classify SDGs data. [Results] The experiments on the official SDGs database provided by the United Nations show that the F1-micro score of the proposed method reaches 0.813, which is 33%, 39%, 52% higher than SeedBTM, STM, and DescLDA models, respectively. When compared with TFIDF and textrank, keywords extracted by our model also outperform TFIDF and textrank with 7% and 25% higher F1-micro scores respectively. [Conclusions] The keyword extraction method proposed in this paper based on textrank and relative word frequency is effective. Compared with the current mainstream topic model algorithms, the unsupervised classification method based on word vector achieves better results.

Key words: Sustainable Development Goals, unsupervised learning, text classification, extraction

雷声,黎建辉,张丽丽. 基于无监督学习的可持续发展目标数据分类[J]. 数据与计算发展前沿, 2021, 3(4): 104-115.

LEI Sheng,LI Jianhui,ZHANG Lili. Data Classification of the Sustainable Development Goals Based on Unsupervised Learning[J]. Frontiers of Data and Computing, 2021, 3(4): 104-115.

图/表 6

图1

表1

表2

图2

表3

图3

参考文献 34

[1]	张军泽, 王帅, 赵文武, 刘焱序, 傅伯杰. 可持续发展目标关系研究进展[J]. 生态学报, 2019,39(22):8327-8337.
[2]	李树深. 数据与计算是科技创新的巨大驱动力[J]. 数据与计算发展前沿, 2019,1(1):1.
[3]	孙哲南, 张兆翔, 王威, 刘菲, 谭铁牛. 2019年人工智能新态势与新进展[J]. 数据与计算发展前沿, 2019,1(2):1-16.
[4]	Song J, Hu R, Sun B, et al. Research on News Keyword Extraction Based on TF-IDF and Chinese Features[C]. Huhhot: 2019 2nd International Conference on Financial Management, 2019:344-352.
[5]	Matsuo Y, Ishizuka M. Keyword Extracyion from a Single Document Using Word Cooccurrence Statistical Informa-tion[J]. International Journal on Artificial Intelligence Tools, 2008,13(1):157-169. doi: 10.1142/S0218213004001466
[6]	罗燕, 赵书良, 李晓超, 等. 基于词频统计的文本关键词提取方法[J]. 计算机应用, 2016,36(3):718-725.
[7]	Barker K, Cornacchia N. Using Noun Phrase Heads to Extract Document Keyphrases[C]. Quebec: Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: 2000: 40-52.
[8]	Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]. Barcelona: Proceedings of Conference on Empirical Methods in Natural Language Processing, 2004: 404-411.
[9]	Bellaachia, Abdelghani, M. Al-Dhelaan. NE-Rank: A Novel Graph-Based Keyphrase Extraction in Twitter[C]. The ssaloniki: IEEE/WIC/ACM International Conferences on Web Intelligence & Intelligent Agent Technology ACM, 2012(1):372-379.
[10]	Saroj Kr. Biswas, Monali Bordoloi, Jacob Shreya. A Graph Based Keyword Extraction Model using Collective Node Weight[J]. Expert Systems with Applications, 2017,97(5):51-59. doi: 10.1016/j.eswa.2017.12.025
[11]	Thomas Hofmann. Probabilistic latent semantic anal-ysis[C]. San Francisco: Proc of the 15th Annual Conf on Uncertainty in Artificial Intelligence, 1999:289-296.
[12]	Blei D, Ng A, Jordan M. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003,1(3):993-1022.
[13]	Le Q V, Mikolov T. Distributed Representatoins of Sen-tences and Documents[C]. Banff: ICML'14: Proceed-ings of the 31st International Conference on International Conference on Machine Learning. 2014,32:1188-1196.
[14]	石晶, 李万龙. 基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010,36(19):81-83.
[15]	刘啸剑, 谢飞. 结合主题分布与统计特征的关键词抽取方法[J]. 计算机工程, 2017,043(007):217-222.
[16]	Druck G, Mann G, Mccallum A. Learning from labeled features using generalized expectation criteria[C]. Sing-apore: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Infor-mation Retrieval, 2008:595-602.
[17]	M Chang, L. Ratinov, D. Roth, V. Srikumar. Importance of semantic representation: Dataless classification[C]. Proceedings of the 23rd national conference on Artificial intelligence, 2008,830-835.
[18]	Y Song and D. Roth. On dataless hierarchical text classi-fication[C]. Quebec: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014:1579-1585.
[19]	Yunqing Xia, Peng Jin, Xingyuan Chen, et al. Dataless Text Classification with Descriptive LDA[C]. Austin: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015:2224-2231.
[20]	Li C, Xing J, Sun A, et al. Effective Document Labeling with Very Few Seed Words: A Topic Model Approach[C]. Indianapolis: Proceedings of the 25th ACM International on Conference on Information and Knowledge Manag-ement, 2016:85-94.
[21]	Yang Y, Wang H, Zhu J, et al. Dataless Short Text Classifi-cation Based on Biterm Topic Model and Word Embed-dings[C]. Yokohama: Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pa-cific Rim International Conference on Artificial Intelligence, 2020:3969-3975.
[22]	张博锋, 白冰, 苏金树. 基于自训练算法的半监督文本分类[J]. 国防科技大学学报, 2007,29(6):65-69.
[23]	陈涛, 安俊秀. 基于特征融合的微博短文本情感分类研究[J]. 数据与计算发展前沿, 2020,2(6):21-29.
[24]	Deerwesster S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990,41(6):391-407. doi: 10.1002/(ISSN)1097-4571
[25]	Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[J]. Advances in Neural Information Processing Systems, 2013,1(26):3111-3119.
[26]	Pennington J, Socher R, Manning CD. GloVe: Global vectorsfor word representation[C]. Doha: Proceedings of 2014 Conference on Empirical Methods in Natural Lan-guage Processing. 2014:1532-1543.
[27]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Unders-tanding[J/OL]. 2018. https://arxiv.org/pdf/1810.04805.pdf .
[28]	Peter J Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis[J]. Journal of Computational & Applied Mathematics, 1987,1(20):53-65.
[29]	姜震, 詹永照. 半监督分类中的噪声控制及相关算法[J]. 江苏大学学报自然科学版, 2015,36(4):435-438.
[30]	United Nations. SDG Indicators Metadata repository[EB/OL]. [2020-10-01]. https://unstats.un.org/sdgs/metadata/ .
[31]	United Nations. SDG indicators : United Nations Global SDG Database[EB/OL]. [2021-03-01]. https://unstats.un.org/sdgs/indicators/database/ .
[32]	Wagner R B W. Natural Language Processing with Py-thon, Analyzing Text with the Natural Language Toolkit by Steven Bird; Ewan Klein; Edward Loper[J]. Language Resources and Evaluation, 2010,44(4):421-424. doi: 10.1007/s10579-010-9124-x
[33]	Stanfordnlp. GloVe: Global Vectors for Word Represen-tation[EB/OL]. [2015-10-01]. https://nlp.stanford.edu/data/wordvecs/glove.6B.zip .
[34]	Google research. Bert[EB/OL].[2020-03-11]. https://storage.googleapis.com/bert_models/2020_02_20/unca-sed_L-4_H-256_A-4.zip .

类别	关键词集
1	poverty,labor,poor,service,disaster
2	food,agriculture,breed,moderate,hunger
3	healthy,vaccine,mortality,disease,infection,alcohol
4	Education, teacher,numeracy,school,parity,child,skill
5	woman,proxy,gender,ownership,care
6	water,sanitation,wastewater,basin,procedure
7	energy,technology,electricity,fuel
8	employment,earn,engage,gdp,violation,labour
9	industry,manufacture,establishment,transport,mobile
10	migration,cost,income,flow,transfer
11	city,disaster,pixel,urban,space
12	policy,material,convention,fossil,waste
13	reduction,risk,climate
14	fish,marine,fishery,ocean,sustainable,ph
15	specie,wildlife,forest,biodiversity
16	right,develop,victim,traffic,chamber
17	development,least,worldwide,broadband,statistical, partnership

model	Result
model	F1-micro	F1-macro	Rec	prec
SeedBTM	0.612	0.576	0.552	0.603
STM	0.583	0.558	0.531	0.589
DescLDA	0.535	0.495	0.486	0.505
SeedETC-glove (TFIDF)	0.757	0.729	0.721	0.740
SeedETC-glove (textrank)	0.651	0.613	0.598	0.625
SeedETC-glove (textrank+RF)	0.813	0.772	0.768	0.782
SeedETC-bert (textrank+RF)	0.770	0.736	0.719	0.753

基于无监督学习的可持续发展目标数据分类

Data Classification of the Sustainable Development Goals Based on Unsupervised Learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 34

相关文章 5

编辑推荐

Metrics

本文评价

类别	SDGs指标元数据集		SDGs数据库
类别	文档数量	平均长度 (/词)	数据数量	数据描述平均长度(/词)
1	15	295	61	10
2	15	258	53	8
3	28	367	58	9
4	13	383	110	14
5	15	235	45	21
6	12	539	90	13
7	6	334	6	10
8	16	295	58	8
9	12	260	30	7
10	14	364	46	10
11	14	440	50	10
12	14	473	37	9
13	8	398	23	10
14	10	388	14	9
15	14	361	33	10
16	26	410	67	12
17	25	235	70	11
总	257	355	851	11

[1]	兰格,王瑾瑜,孙羽菲,张玉志. 基于知识图谱的图匹配文本分类[J]. 数据与计算发展前沿, 2022, 4(2): 39-49.
[2]	刘晓东,倪浩然. 深度学习技术在学科融合研究中的应用[J]. 数据与计算发展前沿, 2020, 2(5): 99-109.
[3]	董家源,杨小渝. 材料数据挖掘与机器学习工具的集成与优化[J]. 数据与计算发展前沿, 2020, 2(4): 105-120.
[4]	张圣林,林潇霏,孙永谦,张玉志,裴丹. 基于深度学习的无监督KPI异常检测[J]. 数据与计算发展前沿, 2020, 2(3): 87-100.
[5]	陈通宝,温亮明,黎建辉. 一种基于特征选择与迁移学习的数据预测方法[J]. 数据与计算发展前沿, 2020, 2(2): 145-154.