Research On Domain Term Extraction Method Based on Deep Learning and Statistical Information

doi:10.11871/jfdc.issn.2096-742X.2022.02.008

Abstract

Abstract:

[Background] Mastering domain terms in time is helpful for understanding the development trend dynamically and for revealing the core knowledge and research hotspots in the field. [Objective] In order to improve the accuracy of domain term extraction, this paper proposes a domain term extraction method based on deep learning and statistical information. [Methods] First, the domain Chinese patent text is represented by word embedding, and the character level vector representation is obtained based on BERT (Bidirectional Encoder Representations from Transformers) as the input of the model. Then, the BiLSTM-CRF (Bidirectional Long Short Term Memory-Conditional Random Field) deep learning model is used to extract the semantic features of the serialized text and obtain the domain term annotation sequences. Finally, the mutual information and left and right entropy of composite structure terms are calculated comprehensively, and the extracted results are corrected with the domain knowledge base. [Results] The model is applied to the field of "extracting lithium from Salt Lake". The results show that the accuracy rate of the BERT-BiLSTM-CRF model is 77.33%, and the correction of the extraction results further improves the accuracy by 3.68%. It is an effective domain term extraction method.

Key words: domain term extraction, BERT, Bidirectional Long Short-Term Memory, conditional random field, mutual information, left and right information entropy

LI Zhenzhen,ZHONG Yongheng,WANG Hui,LIU Jia,SUN Yuan. Research On Domain Term Extraction Method Based on Deep Learning and Statistical Information[J]. Frontiers of Data and Computing, 2022, 4(2): 87-98.

Figures/Tables 8

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Table 2

Table 3

Table 4

References 30

[1]	俞琰, 陈磊, 姜金德, 等. 基于依存句法分析的中文专利候选术语选取研究[J]. 图书情报工作, 2019, 63(18):109-118.
[2]	Jeeeun Kim, Sungjoo Lee. Forecasting and identifying multi-technology convergence based on patent data: the case of IT and BT industries in 2020[J]. Scientometrics, 2017, 111(1):47-65. doi: 10.1007/s11192-017-2275-4
[3]	张雪, 孙宏宇, 辛东兴, 等. 自动术语抽取研究综述[J]. 软件学报, 2020, 31(07):2062-2094.
[4]	Akce Abdullah, Norton James J S, Bretl Timothy. An SSVEP-Based Brain-Computer Interface for Text Spel-ling With Adaptive Queries That Maximize Infor-mation Gain Rates.[J]. IEEE Transactions on Neural Systems & Rehabilitation Engineering, 2015, 23(5):857-866.
[5]	Yang Wei, Lepage Yves. Improving automatic Chinese-Japanese patent translation using bilingual term extra-ction[J]. IEEE Transactions on Electrical and Electronic Engineering, 2018, 13(1):117-125.
[6]	Chiu J, Nichols E. Named Entity Recognition with Bidi-rectional LSTM-CNNs[J]. Computer Science, 2015, arXiv preprint arXiv:1511.08308.
[7]	Zhang Z, Gao J, Ciravegna F. SemRe-Rank: Improving Automatic Term Extraction By Incorporating Semantic Relatedness With Personalised PageRank[J]. Acm Trans-actions on Knowledge Discovery from Data, 2018, 12(5): 1-41.
[8]	Gharaibeh I K, Gharaibeh N K. Towards Arabic Noun Phrase Extractor (ANPE) Using Information Retrieval Techniques[J]. International Journal of Software Engin-eering, 2012, 2(2):36-42.
[9]	Rejwanul Haque, Sergio Penkale, Andy Way. Term-Finder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction[J]. Language Resources and Evaluation, 2018, 52(2):365-400. doi: 10.1007/s10579-018-9412-4
[10]	Zhang X, Song Y, Fang A C. Term recognition using Conditional Random fields[C]// International Conference on Natural Language Processing & Knowledge Engin-eering. IEEE, 2010:1-6.
[11]	王健, 殷旭, 吕学强, 等. 基于CRFs的专利文献领域术语抽取方法[J]. 计算机工程与设计, 2019, 40(01):279-284.
[12]	Ju Z, Wang J, Zhu F. Named Entity Recognition from Biomedical Text Using SVM[C]// International Conference on Bioinformatics & Biomedical Engineer-ing. IEEE, 2011:1-4.
[13]	Lee C M, Huang C K, Tang K M, et al. Iterative Mach-ine-Learning Chinese Term Extraction[C]// International Conference on Asian Digital Libraries. Springer Berlin Heidelberg, 2012:309-312.
[14]	Guan A, Wang Y, Yang L. Automatic term extraction for chinese opera domain ontology[C]// International Con-ference on Fuzzy Systems & Knowledge Discovery. IEEE, 2015:1372-1376.
[15]	黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019, 3(06):66-74.
[16]	俞琰, 赵乃瑄. 基于通用词与术语部件的专利术语抽取[J]. 情报学报, 2018, 037(007):742-752.
[17]	赵东玥, 杜永萍, 石崇德. 基于BLSTM的科技文献术语抽取方法[J]. 情报工程, 2018, 4(01):67-74.
[18]	肖连杰, 孟涛, 王伟, 等. 基于深度学习的情报分析方法识别研究--以安全情报领域为例[J]. 数据分析与知识发现, 2019, 3(10):20-28.
[19]	Greenberg N, Bansal T, Verga P, et al. Marginal Likeli-hood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets[C]// Con-ference on Empirical Methods in Natural Language Pro-cessing. 2018:2824-2829.
[20]	Wei H, Gao M, Zhou A, et al. Named Entity Recognition from Biomedical Texts Using an Fusion Attention-based BiLSTM-CRF[J]. IEEE Access, 2019, 7:73627-73636. doi: 10.1109/ACCESS.2019.2920734
[21]	Ling L, Zhihao Y, Pei Y, et al. An Attention-based BiL-STM-CRF Approach to Document-level Chemical Named Entity Recognition[J]. Bioinformatics, 2018, 34(8):1381-1388. doi: 10.1093/bioinformatics/btx761 pmid: 29186323
[22]	马建红, 张亚梅, 姚爽, 等. 基于BLSTM_attention_CRF模型的新能源汽车领域术语抽取[J]. 计算机应用研究, 2019, 36(05):1385-1389+1395.
[23]	刘宇飞, 尹力, 张凯. 基于深度迁移学习的技术术语识别--以数控系统领域为例[J]. 情报杂志, 2019, 38(10):168-175.
[24]	冯鸾鸾, 李军辉, 李培峰, 等. 面向国防科技领域的技术和术语识别方法研究[J]. 计算机科学, 2019, 46(12):231-236.
[25]	赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(09):923-938.
[26]	吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(04): 409-418.
[27]	Sutskever I, Vinyals O, Le Q V. Sequence to Sequence Learning with Neural Networks[C]// NIPS. MIT Press, 2014:3104-3112.
[28]	何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015(10):88-94.
[29]	Liang J, Shi Z. THE INFORMATION ENTROPY, ROUGH ENTROPY AND KNOWLEDGE GRANULATION IN ROUGH SET THEORY[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2008, 12(01):37-46. doi: 10.1142/S0218488504002631
[30]	李丽双, 王意文, 黄德根. 基于信息熵和词频分布变化的术语抽取研究[J]. 中文信息学报, 2015, 29(01):82-87.

类型	文本
原始文本	本发明提供了一种从盐湖卤水中制备高纯氯化镁的方法,属于氯化镁的制备技术领域。
单字符标注文本	本/O 发/O 明/O 提/O 供/O 了/O 一/O 种/O 从/O 盐/B 湖/I 卤/I 水/I 中/O 制/O 备/O 高/B 纯/I 氯/I 化/I 镁/I 的/O 方/O 法/O,/O 属/O 于/O 氯/B 化/I 镁/I 的/O 制/O 备/O 技/O 术/O 领/O 域/O 。/O
单词汇标注文本	本发明/O 提供/O 了/O 一/O 种/O 从/O 盐湖/B 卤水/I 中/O 制备/O 高纯/B 氯化镁/I 的/O 方法/O,/O 属于/O 氯化镁/S 的/O 制备/O 技术/O 领域/O 。/O

模型名称	评价指标
模型名称	准确率P（%）	召回率 R（%）	F1值（%）
CRF	54.73	55.22	53.7
LSTM	62.44	57.46	59.77
BiLSTM	65.7	59.62	61.93
Word2Vec-BiLSTM-CRF	67.64	66.13	66.88
BiLSTM-CRF	68.51	75.9	71.81
BERT-BiLSTM-CRF	77.33	77.55	77.41
BERT-BiLSTM-CRF+校正	81.01	80.48	82.1

术语长度	术语(频次)
n-gram(n<=2)	卤水(2796)、锂(2148)、分离(825)、离子(801)、蒸发(696)、溶液(588)、萃取(528)、提取(421)、浓缩(402)、反应(401)
n-gram(n=3)	氯化钾(407)、碳酸锂(372)、光卤石(354)、氧化镁(348)、氯化镁(290)、氯化钠(272)、镁锂比(177)、锂离子(175)、硫酸盐(138)、萃取剂(107)
n-gram(n=4)	盐湖卤水(599)、氢氧化镁(186)、固液分离(167)、高镁锂比(116)、蒸发浓缩(98)、软钾镁矾(83)、水氯镁石(78)、饱和卤水(77)、自然蒸发(74)、卤水蒸发(54)
n-gram(n>=5)	镁锂比盐湖卤水(70)、电池级碳酸锂(58)、镁锂比卤水(30)、硫酸根离子(26)、锂盐湖卤水(22)、氯化镁溶液(22)、光卤石分解(22)、氧化镁晶须(22)、氯化镁卤水(21)、碳酸盐型盐湖卤水(21)

模型名称	识别出的术语
人工标注	氯化钾、堆密度、测定方法、堆滤、平均值、饱和卤水
BiLSTM	氯化钾、测定、氯化钾样、卤水、化
BiLSTM-CRF	氯化钾、测定、氯化钾样、饱和、卤水、化
BERT-BiLSTM-CRF	氯化钾、测定、平均值、饱和、卤水
BERT-BiLSTM-CRF+校正	氯化钾、测定、平均值、饱和卤水