数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (5): 120-128.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.05.013

doi: 10.11871/jfdc.issn.2096-742X.2022.05.013

• 技术与应用 • 上一篇    下一篇

基于预训练模型的军事领域命名实体识别研究

童昭*(),王露笛,朱小杰,杜一   

  1. 中国科学院计算机网络信息中心,北京 100083
  • 收稿日期:2021-12-14 出版日期:2022-10-20 发布日期:2022-10-27
  • 通讯作者: 童昭
  • 作者简介:童昭,中国科学院计算机网络信息中心,硕士,助理工程师,主要研究方向为自然语言处理技术,多模态学习技术。
    本文中主要负责算法的实现,论文的撰写。
    TONG Zhao, master’s degree, is an assistant engineer of Computer Network Information Center, Chinese Academy of Sciences. His main research interests include Natural Language Processing technology and Multi-Modal Learning Technology.
    In this article, he is mainly responsible for the realization of the algorithm and the writing of the paper.
    E-mail: ztong@cnic.cn
  • 基金资助:
    国家自然科学基金重点项目(61836013);中国科学院科技服务网络计划(STS计划)区域重点项目(KFJ-STS-QYZD-2021-11-001);北京市自然科学基金资助项目(4212030);北京市科技专项计划(Z191100001119090)

Research on Military Domain Named Entity Recognition Based on Pre-Training Model

TONG Zhao*(),WANG Ludi,ZHU Xiaojie,DU Yi   

  1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
  • Received:2021-12-14 Online:2022-10-20 Published:2022-10-27
  • Contact: TONG Zhao

摘要:

【目的】为了解决开源非结构化军事领域数据的命名实体识别问题。【方法】本文提出基于预训练模型(Bidirectional Encoder Representations from Transformers, BERT)的命名实体识别方法,首先基于预训练BERT模型生成自建开源军事语料的动态特征词向量的字符表示,然后基于双向长短期记忆网络(Bi-directional Long Short-Term Memory, BiLSTM)完成语义特征提取,最后利用条件随机场模型(Conditional Random Fields, CRF)选出最优标签序列完成实体识别的任务。【结果】模型在自建的开源军事数据集上的实验结果表明,相较于基于统计模型和神经网络的方法,本文提出的方法可使准确率提升8%,F值提高11%,召回率提高10%。【局限】由于现阶段在开源军事领域中公开标注数据集较为缺乏,所以未能在开源军事语料上训练BERT模型。【结论】但本文提出的基于预训练模型的开源军事命名实体识别方法,在一定程度上解决了边界划分问题,同时解决了在数据集不足的情况下实体识别任务表现不佳的问题。

关键词: 命名实体识别, 预训练模型, 神经网络

Abstract:

[Objective] In order to solve Named Entity Recognition problems for open source unstructured military domain data. [Methods] This paper proposes a Named Entity Recognition method based on Bidirectional Encoder Representations from Transformers (BERT) model, which first generates a character representation of a dynamic feature word vector based on a self-built open-source military corpus, and then completes the entity recognition task with semantic feature extraction based on Bi-directional Long Short-Term Memory (BiLSTM) and optimal label sequences selected using Conditional Random Fields (CRF). [Results] Experimental results of the model on a self-built open-source military dataset show that the method proposed in this paper can achieve an 8% improvement in accuracy, an 11% improvement in F-value, and a 10% improvement in recall compared to methods based on statistical models and neural networks.[Limitations] Although there is a lack of publicly annotated datasets in the open-source military domain at this stage, it has not been possible to train BERT models on the open-source military corpus. [Conclusions] However, the open-source military named entity recognition method based on pre-trained models proposed in this paper addresses to some extent the boundary delineation problem and the poor performance of the entity recognition task in the presence of insufficient data sets.

Key words: Name Entity Recognition, Pre-Train Model, neutral network