Research on Military Domain Named Entity Recognition Based on Pre-Training Model

doi:10.11871/jfdc.issn.2096-742X.2022.05.013

Abstract

Abstract:

[Objective] In order to solve Named Entity Recognition problems for open source unstructured military domain data. [Methods] This paper proposes a Named Entity Recognition method based on Bidirectional Encoder Representations from Transformers (BERT) model, which first generates a character representation of a dynamic feature word vector based on a self-built open-source military corpus, and then completes the entity recognition task with semantic feature extraction based on Bi-directional Long Short-Term Memory (BiLSTM) and optimal label sequences selected using Conditional Random Fields (CRF). [Results] Experimental results of the model on a self-built open-source military dataset show that the method proposed in this paper can achieve an 8% improvement in accuracy, an 11% improvement in F-value, and a 10% improvement in recall compared to methods based on statistical models and neural networks.[Limitations] Although there is a lack of publicly annotated datasets in the open-source military domain at this stage, it has not been possible to train BERT models on the open-source military corpus. [Conclusions] However, the open-source military named entity recognition method based on pre-trained models proposed in this paper addresses to some extent the boundary delineation problem and the poor performance of the entity recognition task in the presence of insufficient data sets.

Key words: Name Entity Recognition, Pre-Train Model, neutral network

TONG Zhao,WANG Ludi,ZHU Xiaojie,DU Yi. Research on Military Domain Named Entity Recognition Based on Pre-Training Model[J]. Frontiers of Data and Computing, 2022, 4(5): 120-128, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2022.05.013.

Figures/Tables 8

Table 1

Fig.1

Fig. 2

Fig.3

Table 2

Table 3

Table 4

Fig.4

References 17

[1]	DIEFENBACH D, LOPEZ V, SINGH K, et al. Core tech-niques of question answering systems over knowledge bases:a survey[J]. Knowledge and Information systems, 2018, 55(3): 529-569. doi: 10.1007/s10115-017-1100-y
[2]	姜文志, 顾佼佼, 丛林虎. CRF与规则相结合的军事命名实体识别研究[J]. 指挥控制与仿真, 2011, 33(4): 13-15.
[3]	单赫源, 张海粟, 吴照林. 小粒度策略下基于CRFs的军事命名实体识别方法[J]. 装甲兵工程学院学报, 2017, 31(1): 84-89.
[4]	王学锋, 杨若鹏, 朱巍. 基于深度学习的军事命名实体识别方法[J]. 装甲兵工程学院学报, 2018, 32(4): 94-98.
[5]	徐树奎, 曹劲然. 基于层级式Bi-LSTM-CRF模型的军事目标实体识别方法[J]. 信息化研究, 2019, 45(6): 18-22.
[6]	韩鑫鑫, 贲可荣, 张献. 军用软件测试领域的命名实体识别技术研究[J]. 计算机科学与探索, 2020, 14(5): 740-748. doi: 10.3778/j.issn.1673-9418.1906031
[7]	徐树奎, 曹劲然. 基于层级式Bi-LSTM-CRF模型的军事目标实体识别方法[J]. 信息化研究, 2019, 45(6): 18-22.
[8]	盛剑, 向政鹏, 秦兵, 等. 多场景文本的细粒度命名实体识别[J]. 中文信息学报, 2019, 33(6):8-16.
[9]	李韧, 李童, 杨建喜, 等. 基于Transformer-BiLSTM-CRF的桥梁检测领域命名实体识别[J]. 中文信息学报, 2021, 35(4): 83-91.
[10]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Under-standing[J]. NAACL, 2019, 17(1):4171-4186.
[11]	SUN Y, WANG SH, LI YK, et al. Ernie 2.0: A Continual pre-training framework for language understanding[C]. CA:AAAI Press, 2020, 46(5): 8968-8975.
[12]	XU B, XU Y, LIANG JQ, et al. CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System[C]// Proceedings of the 30th International Conference on Indu-strial Engineering and Other Applications of Applied Intelligent Systems, Cham: Springer, 2017, 32(5):428-438.
[13]	Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python[J]. the Journal of ma-chine Learning research, 2011, 12(3): 2825-2830.
[14]	Xu L, Tong Y, Dong Q, et al. CLUENER2020: Fine-grained Named Entity Recognition Dataset and Bench-mark for Chinese[J]. arXiv: 2001.04351.
[15]	张晓海, 操新文, 张敏. 基于自注意力机制的军事命名实体识别[J]. 指挥控制与仿真, 2019, 41(6): 29-33. doi: 10.3969/j.issn.1673-3819.2019.06.006
[16]	单义栋, 王衡军, 黄河, 等. 基于注意力机制的命名实体识别模型研究——以军事文本为例[J]. 计算机科学, 2019, 46(S1): 111-114.
[17]	Daudert T. A Web-based Collaborative Annotation and Consolidation Tool[C]// Proceedings of the 12th Lan-guage Resources and Evaluation Conference, 2020, 23(4): 7053-7059.

实体	标注名称	样例
组织机构	ORG	美空军
型号	VER	KC-135R加油机
行为	ACT	返回基地
起飞地	TAF	南海
目的地	DES	冲绳嘉手纳基地

实体类型	实体开始字符	实体内部字符	实体尾部字符
组织机构ORG	B-ORG	M-ORG	E-ORG
型号VER	B-VER	M-VER	E-VER
行为ACT	B-ACT	M-ACT	E-ACT
起飞地TAF	B-TAF	M-TAF	E-TAF
目的地DES	B-DES	M-DES	E-DES
非实体Other	O	O	O

参数	数值
Bacth_Size	128
词向量维度	200
失活率（Drop Rate）	0.5
学习率（Learning_rate）	10^-4
最大句子长度（Max Seq_Length）	128
LSTM Size	128
优化器(Optimizer)	Adam

Model	实体开始字符	实体内部字符	实体尾部字符
CRF	84.46	82.34	83.57
HMM	87.26	85.18	86.19
BiLSTM	88.72	87.73	87.86
BiLSTM-CRF	90.14	91.48	92.74
BERT- BiLSTM-CRF	92.78	93.19	94.61