数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (4): 86-100.
CSTR: 32002.14.jfdc.CN10-1649/TP.2023.04.008
doi: 10.11871/jfdc.issn.2096-742X.2023.04.008
收稿日期:
2022-01-20
出版日期:
2023-08-20
发布日期:
2023-08-23
通讯作者:
*汪洋(E-mail: 作者简介:
李俊飞, 中国科学院计算机网络信息中心,中国科学院大学,硕士研究生。主要研究领域为:自然语言处理、文献引文分析。基金资助:
LI JunFei1,2(),XU LiMing1,2,WANG Yang1,2,*(),WEI Xin1
Received:
2022-01-20
Online:
2023-08-20
Published:
2023-08-23
摘要:
【目的】科技文献引文分类是学术影响力评估、文献检索推荐等的基础工作。随着深度神经网络和预训练语言模型的发展,科技文献引文分类研究取得巨大成果。学界提出了许多基于深度学习技术的科技文献引文分类方法、模型和数据集。然而,目前仍然缺乏对现有方法和最新趋势的全面调研,因此本文在这方面进行了探索。【方法】本文梳理了基于深度学习技术的科技文献引文分类模型、数据集,并对不同模型的分类性能进行了对比和分析;归纳了不同模型的优缺点,对科技文献引文分类技术进行总结;讨论了未来的发展方向,并提出了建议。【结果】预训练语言模型能够有效地学习全局语义表示,改善了RNNs(Recurrent Neural Networks)训练效率低、CNNs(Convolutional Neural Networks)提取文本序列依赖特征长度有限等问题,显著提高了分类准确率。【局限】本文以介绍科技文献引文分类技术的进展为主,没有对未来技术的发展方向进行全面预测。
李俊飞, 徐黎明, 汪洋, 魏鑫. 基于深度学习技术的科技文献引文分类研究综述[J]. 数据与计算发展前沿, 2023, 5(4): 86-100.
LI JunFei, XU LiMing, WANG Yang, WEI Xin. Review of Automatic Citation Classification Based on Deep Learning Technology[J]. Frontiers of Data and Computing, 2023, 5(4): 86-100, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2023.04.008.
表1
科技文献引文分类体系"
数据集 | 样本数 | 分类标签及占比 |
---|---|---|
Teufel et al.(2006b)[ | 2829 | Weak(3.1%)CoCoGM(3.9%)CoCoR0(0.8%)CoCo(1.0%)CoCoXY(2.9%)PBas(1.5%)PUse(15.8%)PModi(1.6%)PMot(2.2%)PSim(3.8%)PSup(1.1%)Neut(62.7%) |
Ulrich(2011)[ | 1768 | Idea(23.80%)Basis(7.18%)Background(65.04%)Compare(3.95%) |
Li et al.(2013)[ | 6355 | Based on(2.8%)Corroboration(3.6%)Discover(12.3%)Positive(0.1%)Significant(0.6%)Standard(0.2%)Supply(1.2%)Contrast(0.6%)Co-citation(33.3%) |
Hernandez-Alvarez et al.(2016)[ | 2120 | Use(49.8%)Background(37.4%)Comparison(5.3%)Critique(7.8%) |
Matthew et al.(2018)[ | 3083 | Background(51.8%)Uses(18.5%)Compares(17.5%)Motivation(4.9%)Continuation(3.7%)Future(3.6%) |
Cohan et al.(2019)[ | 11020 | Background(58%) Method(29%) Result(13%) |
Zhu et al.(2015)[ | 3143 | Influential Non-influential |
Valenzuela et al. (2015)[ | 450 | Important Incidental |
Jha et al.(2016)[ | 3271 | criticizing(16.3%)、comparison(8.1%)、use(18.0%)、substantiating(8%)、basis(5.3%)、neutral(44.3%) |
表4
基于卷积神经网络模型的分类性能"
实验模型 | Precision (%) | Recall(%) | F1(%) | 数据集 |
---|---|---|---|---|
CNN General emb[ | 79.9 | 68.2 | 73.6 | Jha et al.(2016)[ |
CNN CORE emb[ | 80.8 | 68.8 | 74.3 | |
CNN ACL emb[ | 76.7 | 68.4 | 72.3 | |
SciBERT-BiGRU-Multi-CNN[ | 84.68 | 81.59 | 83.11 | |
SciBERT-Multi-BiGRi-CNN-Attention[ | 85.58 | 82.75 | 84.14 | |
SciBERT-BiGRU-Multi-CNN-Attention[ | 86.67 | 83.24 | 84.92 |
表5
基于循环神经网络模型的分类性能"
实验模型 | 数据集 | F1(%) | ||
---|---|---|---|---|
样本数量 | 分类类型 | 分类标签及比例 | ||
LSTMs[ | 3422 | 单标签 | Background(30.5%) Method(23.9%) Results/findings(45.3%) Don't know(0.1%) | 66.42 |
LSTMs + Global Attention[ | 68.61 | |||
BiLSTMs[ | 67.88 | |||
BiLSTMs + Global Attention[ | 68.61 | |||
BiLSTM-Attn[ | 11020 | 单标签 | Background(58%) Method(29%) Result comparison(13%) | 77.2 |
BiLSTM-Attn w/ ELMo[ | 82.6 | |||
BiLSTM-Attn + section title scaffold[ | 77.8 | |||
BiLSTM-Attn + citation worthiness scaffold[ | 78.1 | |||
BiLSTM-Attn + both scaffolds[ | 79.1 | |||
BiLSTM-Attn w/ ELMo + both scaffolds[ | 84 |
表6
基于预训练语言模型的分类性能"
实验模型 | 数据集 | |||||
---|---|---|---|---|---|---|
ACL-ARC | SciCite | |||||
Precision(%) | Recall(%) | F1(%) | Precision(%) | Recall(%) | F1(%) | |
BERT-KMeans[ | * | * | * | 81 | 82 | 81 |
BERT-HDBSCAN[ | * | * | * | 77 | 79 | 78 |
BASE-BERT[ | * | * | 63.91 | * | * | 84.85 |
ELMo[ | * | * | 67.9 | 84 | ||
ALBERT[ | * | * | * | * | * | 82.86 |
SciBERT[ | * | * | 70.98 | * | * | 85.49 |
XLNet[ | * | * | * | * | 88.93 |
[1] | WEI X, WANG Y. Research and Practice on Evaluation System of Science and Technology Competitiveness[J]. Frontiers of Data and Computing, 2021, 3(1): 74-67. |
[2] | WANG S W, XU Y J, CHEN Y P, et al. Influence Mech-anism of Code-Sharing on Paper Citations: An Empirical Analysis on Computer Science Field[J]. Frontiers of Data & Computing, 2021, 3(2): 93-102. |
[3] | HJORLAND B, NIELSEN L K. Subject Access Points in Electronic Retrieval[J]. Annual Review of Information Science and Technology (ARIST), 2001, 35: 249-98. |
[4] | HIRSCH J E. An index to quantify an individual's scien-tific research output[J]. Proceedings of the National Aca-demy of ences of the United States of America, 2005, 102(46): 16569-16572. |
[5] | 陈云伟. 科技评价计量方法述评[J]. 农业图书情报学报, 2020, 32(8): 8. |
[6] | VOOS H, DAGAEV K S. Are All Citations Equal? Or, Did We Op. Cit. Your Idem?[J]. Journal of Academic Librarianship, 1976, 1(6): 19-21. |
[7] | HERLACH G. Can retrieval of information from citation indexes be simplified? Multiple mention of a reference as a characteristic of the link between cited and citing article[J]. Journal of the Association for Information Science & Technology, 2010, 29(6): 308-310. |
[8] |
Small H G. Cited Documents as Concept Symbols[J]. Social Studies of Science, 1978, 8(3): 327-40.
doi: 10.1177/030631277800800305 |
[9] | GARFIELD E. Can citation indexing be automated[C]// Statistical association methods for mechanized docum-entation, symposium proceedings, 1965, 269: 189-192. |
[10] |
MICHAEL J. Moravcsik and Poovanalingam Murugesan. Some results on the function and quality of citations[J]. Social Studies of Science, 1975, 5(1): 86-92.
doi: 10.1177/030631277500500106 |
[11] | TEUFEL S, SIDDHARTHAN A, TIDHAR D. Automatic classification of citation function[C]// Proceedings of the 2006 conference on empirical methods in natural langu-age processing, 2006: 103-110. |
[12] | ULRICH S. Ensemble-style Self-training on Citation Class-ification[J]. Proceedings of Ijcnlp, 2011: 623-631. |
[13] | LI X, HE Y, MEYERS A, et al. Towards fine-grained citation function classification[C]// Proceedings of the Inter-national Conference Recent Advances in Natural Lang-uage Processing RANLP 2013, 2013: 402-407. |
[14] |
HERNANDEZ-ALVAREZ M, GOMEZ J M. Survey about citation context analysis: Tasks, techniques, and resources[J]. Natural Language Engineering, 2016, 22(pt.3): 327-349.
doi: 10.1017/S1351324915000388 |
[15] | MATTHEW E P, MARK N, MOHIT I, et al. Deep cont-extualizedword representations[J]. arXiv preprint arXiv: 1802.05365, 2018. |
[16] | COHAN A, AMMAR W, VAN ZUYLEN M, et al. Structural Scaffolds for Citation Intent Classification in Scientific Publications[C]// Proceedings of NAACL-HLT, 2019: 3586-3596. |
[17] |
ZHU X, TURNEY P, LEMIRE D, et al. Measuring acad-emic influence: Not all citations are equal[J]. Journal of the Association for Information Science and Technology, 2015, 66(2): 408-427.
doi: 10.1002/asi.2015.66.issue-2 |
[18] | VALENZUELA M, HA V, ETZIONI O. Identifying mea-ningful citations[C]// Workshops at the twenty-ninth AA-AI conference on artificial intelligence, 2015: 21-26. |
[19] |
JHA R, JBARA A A, QAZVINIAN V, et al. NLP-driven citation analysis for scientometrics[J]. Natural Language Engineering, 2016, 1(PT.1): 1-38.
doi: 10.1017/S1351324900000036 |
[20] | GARZONE M A. Automated classification of citations using linguistic semantic grammars[D]. The University of Western Ontario (Canada), 1997. |
[21] | NANBA H, KANDO N, OKUMURA M. Classification of research papers using citation links and citation types: Towards automatic review article generation[J]. Adva-nces in Classification Research Online, 2000, 11(1): 117-134. |
[22] | PHAM S B, HOFFMANN A. A new approach for scie-ntific citation classification using cue phrases[C]// Austr-alasian Joint Conference on Artificial Intelligence, Spr-inger, Berlin, Heidelberg, 2003: 759-771. |
[23] |
COVER T, HART P. Nearest neighbor pattern classifi-cation[J]. IEEE transactions on information theory, 1967, 13(1): 21-27.
doi: 10.1109/TIT.1967.1053964 |
[24] | ANGROSH M A, CRANEFIELD S, STANGER N. Cont-ext identification of sentences in related work sections using a conditional random field: towards intelligent digital libraries[C]// Proceedings of the 10th annual joint conference on Digital libraries, 2010: 293-302. |
[25] | LAFFERTY J, MCCALLUM A, PEREIRA F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In Proceed-ings of ICML-01, 2001: 282-289. |
[26] | 尹莉, 郭璐, 李旭芬. 基于引用功能和引用极性的一个引用分类模型研究[J]. 情报杂志, 2018, 37(7): 139-145. |
[27] | CORTES C, VAPNIK V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3): 273-297. |
[28] | 柏晗. 基于加权引文的贝叶斯分类研究[D]. 南京: 南京大学, 2016. |
[29] |
LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998.DOI:10.1109/5.726791.
doi: 10.1109/5.726791 |
[30] |
ELMAN J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2): 179-211.
doi: 10.1207/s15516709cog1402_1 |
[31] |
HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276 |
[32] |
VASWANI A, SHAZEER N, PARMAR N, et al. Atte-ntion Is All You Need[J]. arXiv, 2017.DOI:10.48550/arXiv.1706.03762.
doi: 10.48550/arXiv.1706.03762 |
[33] | CHEN Y. Convolutional neural network for sentence cla-ssification[D]. University of Waterloo, 2015. |
[34] | Edouard Grave. Facebookresearch[EB/OL]. https://git-hub.com/facebookresearch/fastText. |
[35] | JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of Tri-cks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Assoc-iation for Computational Linguistics: Volume 2, Short Papers, 2017: 427-431. |
[36] | JOULIN A, GRAVE E, BOJANOWSKI P, et al. Fasttext. zip: Compressing text classification models[J]. arXiv preprint arXiv:1612.03651, 2016. |
[37] | YIN W, KANN K, YU M, et al. Comparative study of CNN and RNN for natural language processing[J]. arXiv preprint arXiv:1702.01923, 2017. |
[38] | LAUSCHER A, GLAVAŠ G, PONZETTO S P, et al. Inv-estigating convolutional networks and domain-specific embeddings for semantic classification of citations[C]// Proceedings of the 6th international workshop on mining scientific publications, 2017: 24-28. |
[39] | 周文远, 王名扬, 井钰. 基于AttentionSBGMC模型的引文情感和引文目的自动分类研究[J]. 数据分析与知识发现, 2021, 5(12): 12. |
[40] | CHO K, VAN MERRIËNBOER B, BAHDANAU D, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[C]// Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014: 103-111. |
[41] | SUTSKEVER I, VINYALS O, LE Q V, et al. Sequence to Sequence Learning with Neural Networks[C]// NIPS. European Language Resources Association (ELRA), 2014, 195: 496-527. |
[42] | BOWMAN S, ANGELI G, POTTS C, et al. A large ann-otated corpus for learning natural language inference[C]// Proceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, 2015: 632-642. |
[43] | MUNKHDALAI T, LALOR J P, YU H. Citation analysis with neural attention models[C]// Proceedings of the Seve-nth International Workshop on Health Text Mining and Information Analysis, 2016: 69-77. |
[44] |
HASSAN S U, IMRAN M, IQBAL S, et al. Deep context of citations using machine-learning models in scholarly full-text articles[J]. Scientometrics, 2018, 117(3): 1645-1662.
doi: 10.1007/s11192-018-2944-y |
[45] |
BREIMAN. Random forests[J]. MACH LEARN, 2001, 45(1): 5-32.
doi: 10.1023/A:1010933404324 |
[46] |
PRESTER J, WAGNER G, SCHRYEN G, et al. Cla-ssifying the ideational impact of information systems review articles: A content-enriched deep learning appr-oach[J]. Decision Support Systems, 2021, 140: 113432.
doi: 10.1016/j.dss.2020.113432 |
[47] | PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation[C]// Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014: 1532-1543. |
[48] | JURGENS D, KUMAR S, HOOVER R, et al. Mea-suring the Evolution of a Scientific Field through Citat-ion Frames[J]. Transactions of the Association for Comp-utational Linguistics, 2018, 6(6): 391-406. |
[49] |
NICHOLSON J M, MORDAUNT M, LOPEZ P, et al. scite: a smart citation index that displays the context of citations and classifies their intent using deep learning[J]. Quantitative Science Studies, 2021, 2(3): 882-898.
doi: 10.1162/qss_a_00146 |
[50] | BELTAGY I, LO K, COHAN A. SciBERT: A Pretrained Language Model for Scientific Text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con-ference on Natural Language Processing (EMNLP-IJCN-LP), 2019: 3615-3620. |
[51] | GE Y, DINH L, LIU X, et al. BACO: A Background Knowledge-and Content-Based Framework for Citing Sentence Generation[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers), 2021: 1466-1478. |
[52] | LAN Z, CHEN M, GOODMAN S, et al. Albert: A lite bert for self-supervised learning of language repre-sentations[J]. arXiv preprint arXiv:1909.11942, 2019. |
[53] | YANG Z, DAI Z, YANG Y, et al. XLNet: generalized autoregressive pretraining for language understa-nding[C]// Proceedings of the 33rd International Confere-nce on Neural Information Processing Systems, 2019: 5753-5763. |
[54] | ZHUANG L, WAYNE L, YA S, et al. A Robustly Optim-ized BERT Pre-training Approach with Post-training[C]// Proceedings of the 20th Chinese National Conference on Computational Linguistics, 2021: 1218-1227. |
[55] |
LIU G H, YANG J Y. Image retrieval based on the texton co-occurrence matrix[J]. Pattern Recognition, 2008, 41 (12) : 3521-3527.
doi: 10.1016/j.patcog.2008.06.010 |
[56] | ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]// kdd, 1996, 96(34): 226-231. |
[57] | Kenton J D M W C, Toutanova L K. BERT: Pre-training of Deep Bidirectional Transformers for Language Under-standing[C]// Proceedings of NAACL-HLT, 2019: 4171-4186. |
[58] | LI B, ZHU Z, THOMAS G, et al. How is BERT surp-rised? Layerwise detection of linguistic anomalies[C]// Proceedings of the 59th Annual Meeting of the Assoc-iation for Computational Linguistics and the 11th Int-ernational Joint Conference on Natural Language Proc-essing (Volume 1:Long Papers), 2021: 4215-4228. |
[59] |
TUAROB S, KANG S W, WETTAYAKORN P, et al. Automatic classification of algorithm citation functions in scientific literature[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 32(10): 1881-1896.
doi: 10.1109/TKDE.69 |
[60] | MERCIER D, RIZVI S T R, RAJASHEKAR V, et al. ImpactCite: An XLNet-based method for Citation Impact Analysis[J]. arXiv preprint arXiv:2005.06611, 2020. |
[61] |
ROMAN M, SHAHID A, KHAN S, et al. Citation intent classification using word embedding[J]. IEEE Access, 2021, 9: 9982-9995.
doi: 10.1109/ACCESS.2021.3050547 |
[62] | CHEN H, NGUYEN H. Fine-tuning Pre-trained Con-textual Embeddings for Citation Content Analysis in Scholarly Publication[J]. arXiv preprint arXiv: 2009.05836, 2020. |
[63] | LEI T. When Attention Meets Fast Recurrence: Tra-ining Language Models with Reduced Compute[C]// Proce-edings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021: 7633-7648. |
[64] | ULRICH S. Ensemble-style Self-training on Citation Classification[C]. Proceedings of 5th International Joint Conference on Natural Language Processing, 2011: 623-631. |
[65] |
JHA R, JBARA A A, QAZVINIAN V, et al. NLP-driven citation analysis for scientometrics[J]. Natural Language Engineering, 2017, 23(1): 93-130.
doi: 10.1017/S1351324915000443 |
[66] | LAUSCHER A, KO B, KUEHL B, et al. MultiCite: Mod-eling realistic citations requires moving beyond the sin-gle-sentence single-label setting[J]. arXiv preprint arXiv: 2107.00414, 2021. |
[67] | XIE Y, SUN Y, BERTINO E. Learning domain sem-antics and cross-domain correlations for paper recommen-dation[C]// Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021: 706-715. |
[1] | 巨家骥, 黄勃, 张帅, 郭茹燕. 融合情感词典和自注意力的双通道情感分析模型[J]. 数据与计算发展前沿, 2023, 5(4): 101-111. |
[2] | 赵悦阳, 崔雷. 文本嵌入技术的研究与应用进展[J]. 数据与计算发展前沿, 2023, 5(3): 92-110. |
[3] | 李妍,何洪波,王闰强. 微博热度预测研究综述[J]. 数据与计算发展前沿, 2023, 5(2): 119-135. |
[4] | 刘云帆,李琦,孙哲南,谭铁牛. 基于生成对抗网络的人脸年龄编辑方法综述[J]. 数据与计算发展前沿, 2023, 5(2): 2-23. |
[5] | 涂又友,郑奇靖,赵瑾. 基于深度学习方法研究分子/固体界面量子化质子耦合的电荷转移过程[J]. 数据与计算发展前沿, 2023, 5(2): 37-49. |
[6] | 许淞源,刘峰. ESDRec:一种面向地球大数据平台的数据推荐模型[J]. 数据与计算发展前沿, 2023, 5(1): 55-64. |
[7] | 李东闻,钟震宇,申峻宇,王昊天,孙羽菲,张玉志. NKCorpus:利用海量网络数据构建大型高质量中文数据集[J]. 数据与计算发展前沿, 2022, 4(3): 30-45. |
[8] | 陈琼,杨咏,黄天林,冯媛. 小样本图像语义分割综述[J]. 数据与计算发展前沿, 2021, 3(6): 17-34. |
[9] | 蒲晓蓉,黄佳欣,刘军池,孙家瑜,罗纪翔,赵越,陈柯成,任亚洲. 面向临床需求的CT图像降噪综述[J]. 数据与计算发展前沿, 2021, 3(6): 35-49. |
[10] | 何涛,王桂芳,马廷灿. 基于词嵌入语义异常的跨学科研究内容发现方法[J]. 数据与计算发展前沿, 2021, 3(6): 50-59. |
[11] | 张怡宁,何洪波,王闰强. 热门数字音频预测技术综述[J]. 数据与计算发展前沿, 2021, 3(4): 81-92. |
[12] | 陈子健,李俊,岳兆娟,赵泽方. 基于自编码器与属性信息的混合推荐模型[J]. 数据与计算发展前沿, 2021, 3(3): 148-155. |
[13] | 肖建平,龙春,赵静,魏金侠,胡安磊,杜冠瑶. 基于深度学习的网络入侵检测研究综述[J]. 数据与计算发展前沿, 2021, 3(3): 59-74. |
[14] | 李序,连一峰,张海霞,黄克振. 网络安全知识图谱关键技术[J]. 数据与计算发展前沿, 2021, 3(3): 9-18. |
[15] | 赵伟昱,张宏海,仲波. 基于深度学习的遥感影像地块分割方法[J]. 数据与计算发展前沿, 2021, 3(2): 133-141. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||