数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (2): 87-98.

doi: 10.11871/jfdc.issn.2096-742X.2022.02.008

• 技术与应用 • 上一篇    下一篇

基于深度学习与统计信息的领域术语抽取方法研究

李贞贞1,2,*(),钟永恒1,2(),王辉1,2(),刘佳1,2(),孙源1,2()   

  1. 1.中国科学院武汉文献情报中心,湖北 武汉 430071
    2.科技大数据湖北省重点实验室,湖北 武汉 430071
  • 收稿日期:2021-06-04 出版日期:2022-04-20 发布日期:2022-04-30
  • 通讯作者: 李贞贞
  • 作者简介:李贞贞,中国科学院武汉文献情报中心,馆员,研究领域为科技和产业大数据分析方法与技术。
    文本中负责论文思路设计,实验设计与实施,论文撰写。
    LI Zhenzhen is a librarian of Wuhan Library of Chinese Academy of Sciences. Her research fields include technology and industry big data analysis methods and techniques.
    In this paper, she is responsible for the design of ideas, the design and implementation of experiments, and the writing of this paper.
    E-mail: lizz@mail.whlib.ac.cn|钟永恒,中国科学院武汉文献情报中心,研究员,研究领域为科技政策与学科情报研究、产业竞争情报与产业技术分析、产业智库与大数据建设研究。
    文本中负责研究思路指导。
    ZHONG Yongheng is a researcher of Wuhan Library of Chinese Academy of Sciences. His research fields include science and technology policy and subject in-formation research, industrial competitive intelligence and industrial technology analysis, industrial think tank and big data construction research.
    In this paper, he is responsible for guiding the research ideas.
    E-mail: zyh@mail.whlib.ac.cn|王辉,中国科学院武汉文献情报中心,馆员,研究领域为产业技术分析、大数据分析。
    文本中负责提供论文修改建议。
    WANG Hui is a librarian of Wuhan Li-brary of Chinese Academy of Sciences. His research fields in-clude industrial technology analysis and big data analysis.
    In this paper, he is responsible for providing suggestions for the revision of the paper.
    E-mail: wangh@mail.whlib.ac.cn|刘佳,中国科学院武汉文献情报中心,副研究员,研究领域为大数据情报分析、知识产权分析与评价。
    文本中负责采集数据,语料标注校对。
    LIU Jia is an associate researcher of Wu-han Library of Chinese Academy of Sci-ences. Her research fields include big data intelligence analysis, intellectual property analysis and evaluation.
    In this paper, she is responsible for data collection, corpus annotation and proofreading.
    E-mail: liuj@mail.whlib.ac.cn|孙源,中国科学院武汉文献情报中心,馆员,研究领域为数据挖掘与自然语言处理技术。
    文本中负责清洗数据,语料标注校对。
    SUN Yuan is a librarian of Wuhan Library of Chinese Academy of Sciences. His research fields include data mining and natural language proc-essing technology.
    In this paper, he is responsible for data cleaning, corpus ann-otation and proofreading.
    E-mail: suny@mail.whlib.ac.cn
  • 基金资助:
    中国科学院武汉文献情报中心前瞻性项目(Y9KZ401);青海省企业研究转化与产业化专项(2018-GX-C22);青海省创新平台建设专项(2019-ZJ-T02)

Research On Domain Term Extraction Method Based on Deep Learning and Statistical Information

LI Zhenzhen1,2,*(),ZHONG Yongheng1,2(),WANG Hui1,2(),LIU Jia1,2(),SUN Yuan1,2()   

  1. 1. Wuhan Library of Chinese Academy of Sciences, Wuhan, Hubei 430071, China
    2. Hubei Key Laboratory of Big Data in Science and Technology, Wuhan, Hubei 430071,China
  • Received:2021-06-04 Online:2022-04-20 Published:2022-04-30
  • Contact: LI Zhenzhen

摘要:

【背景】及时掌握领域术语有助于动态把握领域发展方向,揭示领域的核心知识与研究热点。【目的】为提高领域术语抽取准确率,提出一种基于深度学习和统计信息的领域术语抽取方法。【方法】首先,对领域中文专利文本进行字嵌入表示,基于BERT(Bidirectional Encoder Representations from Transformers)获取字符级的向量表征作为模型的输入;然后,利用BiLSTM-CRF(Bidirectional Long Short Term Memory-Conditional Random Field)深度学习模型提取序列化文本的语义特征,得到领域术语标注序列;最后,综合计算复合结构术语的互信息和左右熵,并结合领域知识库对抽取结果进行校正。【结果】模型在“盐湖提锂”领域进行实验,结果表明BERT-BiLSTM-CRF模型抽取该领域术语准确率达到77.33%,而对抽取结果进行校正进一步将准确率提升了3.68%,是一种有效的领域术语抽取方法。

关键词: 领域术语抽取, BERT, 双向长短时记忆网络, 条件随机场, 互信息, 左右信息熵

Abstract:

[Background] Mastering domain terms in time is helpful for understanding the development trend dynamically and for revealing the core knowledge and research hotspots in the field. [Objective] In order to improve the accuracy of domain term extraction, this paper proposes a domain term extraction method based on deep learning and statistical information. [Methods] First, the domain Chinese patent text is represented by word embedding, and the character level vector representation is obtained based on BERT (Bidirectional Encoder Representations from Transformers) as the input of the model. Then, the BiLSTM-CRF (Bidirectional Long Short Term Memory-Conditional Random Field) deep learning model is used to extract the semantic features of the serialized text and obtain the domain term annotation sequences. Finally, the mutual information and left and right entropy of composite structure terms are calculated comprehensively, and the extracted results are corrected with the domain knowledge base. [Results] The model is applied to the field of "extracting lithium from Salt Lake". The results show that the accuracy rate of the BERT-BiLSTM-CRF model is 77.33%, and the correction of the extraction results further improves the accuracy by 3.68%. It is an effective domain term extraction method.

Key words: domain term extraction, BERT, Bidirectional Long Short-Term Memory, conditional random field, mutual information, left and right information entropy