Frontiers of Data and Domputing ›› 2022, Vol. 4 ›› Issue (2): 87-98.

doi: 10.11871/jfdc.issn.2096-742X.2022.02.008

• Technology and Applicaton • Previous Articles     Next Articles

Research On Domain Term Extraction Method Based on Deep Learning and Statistical Information

LI Zhenzhen1,2,*(),ZHONG Yongheng1,2(),WANG Hui1,2(),LIU Jia1,2(),SUN Yuan1,2()   

  1. 1. Wuhan Library of Chinese Academy of Sciences, Wuhan, Hubei 430071, China
    2. Hubei Key Laboratory of Big Data in Science and Technology, Wuhan, Hubei 430071,China
  • Received:2021-06-04 Online:2022-04-20 Published:2022-04-30
  • Contact: LI Zhenzhen E-mail:lizz@mail.whlib.ac.cn;zyh@mail.whlib.ac.cn;wangh@mail.whlib.ac.cn;liuj@mail.whlib.ac.cn;suny@mail.whlib.ac.cn

Abstract:

[Background] Mastering domain terms in time is helpful for understanding the development trend dynamically and for revealing the core knowledge and research hotspots in the field. [Objective] In order to improve the accuracy of domain term extraction, this paper proposes a domain term extraction method based on deep learning and statistical information. [Methods] First, the domain Chinese patent text is represented by word embedding, and the character level vector representation is obtained based on BERT (Bidirectional Encoder Representations from Transformers) as the input of the model. Then, the BiLSTM-CRF (Bidirectional Long Short Term Memory-Conditional Random Field) deep learning model is used to extract the semantic features of the serialized text and obtain the domain term annotation sequences. Finally, the mutual information and left and right entropy of composite structure terms are calculated comprehensively, and the extracted results are corrected with the domain knowledge base. [Results] The model is applied to the field of "extracting lithium from Salt Lake". The results show that the accuracy rate of the BERT-BiLSTM-CRF model is 77.33%, and the correction of the extraction results further improves the accuracy by 3.68%. It is an effective domain term extraction method.

Key words: domain term extraction, BERT, Bidirectional Long Short-Term Memory, conditional random field, mutual information, left and right information entropy