数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (4): 86-100.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.04.008

doi: 10.11871/jfdc.issn.2096-742X.2023.04.008

• 技术与应用 • 上一篇    下一篇

基于深度学习技术的科技文献引文分类研究综述

李俊飞1,2(),徐黎明1,2,汪洋1,2,*(),魏鑫1   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,计算机科学与技术学院,北京 100049
  • 收稿日期:2022-01-20 出版日期:2023-08-20 发布日期:2023-08-23
  • 通讯作者: *汪洋(E-mail: wangyang@cnic.cn
  • 作者简介:李俊飞, 中国科学院计算机网络信息中心,中国科学院大学,硕士研究生。主要研究领域为:自然语言处理、文献引文分析。
    本文主要承担工作为引文自动分类研究进展总结及文章撰写。
    LI Jun Fei is a master’s student at the Computer Network Information Center of Chinese Academy of Sciences, University of Chinese Academy of Sciences. His main research fields are natural language processing and literature citation analysis.
    In this paper, he is mainly responsible for summary of the research progress of automatic citation classification and article writing.
    E-mail: lijunfei@cnic.cn|汪洋,中国科学院计算机网络信息中心,博士,高级工程师,硕士研究生导师,信息化发展战略与评估中心主任。主要研究领域为:信息化发展战略研究、大数据分析。
    本文主要承担工作为设计论文整体框架。
    WANG Yang, Ph.D., Senior Engineer, Postgraduate Super-visor, Director of information development strategy and evaluation center, Computer Network Information Center Chinese Academy of Sciences. His main research fields include information development strategy research and big data analysis.
    In this paper, he is mainly responsible for the paper framework design.
    E-mail: wangyang@cnic.cn
  • 基金资助:
    中国科学院态势感知运行维护与应用支持项目(WX1450201-0105-02)

Review of Automatic Citation Classification Based on Deep Learning Technology

LI JunFei1,2(),XU LiMing1,2,WANG Yang1,2,*(),WEI Xin1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2022-01-20 Online:2023-08-20 Published:2023-08-23

摘要:

【目的】科技文献引文分类是学术影响力评估、文献检索推荐等的基础工作。随着深度神经网络和预训练语言模型的发展,科技文献引文分类研究取得巨大成果。学界提出了许多基于深度学习技术的科技文献引文分类方法、模型和数据集。然而,目前仍然缺乏对现有方法和最新趋势的全面调研,因此本文在这方面进行了探索。【方法】本文梳理了基于深度学习技术的科技文献引文分类模型、数据集,并对不同模型的分类性能进行了对比和分析;归纳了不同模型的优缺点,对科技文献引文分类技术进行总结;讨论了未来的发展方向,并提出了建议。【结果】预训练语言模型能够有效地学习全局语义表示,改善了RNNs(Recurrent Neural Networks)训练效率低、CNNs(Convolutional Neural Networks)提取文本序列依赖特征长度有限等问题,显著提高了分类准确率。【局限】本文以介绍科技文献引文分类技术的进展为主,没有对未来技术的发展方向进行全面预测。

关键词: 科技文献引文分类, 预训练语言模型, 深度学习, 自然语言处理

Abstract:

[Objective] The citation classification of scientific and technological literature is the basic work of academic influence evaluation and literature retrieval and recommendation. With the development of deep neural networks and pre-trained language models, the research on citation classification of scientific and technological literature has achieved great success. Many citation classification models, data sets, and methods for scientific and technological documents based on deep learning technology have been proposed in the literature. However, there is still a lack of comprehensive research on existing methods and the latest trends. This paper makes up for this gap. [Methods] This paper studies the citation classification model and data set of scientific and technological literature based on deep learning technology, compares and analyzes the performance of different models as well as their advantages and disadvantages, summarizes the citation classification technology for scientific and technological literacy, and discusses the future development direction. [Results] The classification model based on the pre-trained language model can effectively learn the global semantic representation, improve the problems of low training efficiency of RNNs (Recurrent Neural Networks) and limited length of dependent features of text sequences extracted by CNNs (Convolutional Neural Networks), and significantly improve the classification accuracy. [Limitations] This paper mainly introduces the progress of citation classification technology in scientific and technological literature, and does not comprehensively predict the development direction of technology in the future.

Key words: citation classification of scientific and technological documents, pre-trained language model, deep learning, natural language processing