基于深度学习和指代消解的中文人名识别

doi:10.11871/jfdc.issn.2096-742X.2022.02.006

数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (2): 63-73.

doi: 10.11871/jfdc.issn.2096-742X.2022.02.006

• 专刊：先进智能计算平台及应用 • 上一篇下一篇

基于深度学习和指代消解的中文人名识别

陈雨(),玄宇航(),张玉志^*()

南开大学,软件学院,天津 300450

收稿日期:2022-02-13 出版日期:2022-04-20 发布日期:2022-04-30
通讯作者: 张玉志
作者简介:陈雨,南开大学软件学院,研究生,有通信专业背景,主要研究方向为自然语言处理、数据分析。
本文主要工作为针对人民日报和上市公告数据,将早期的规则算法,与深度学习方法、指代消解相结合,解决眼下处理文本数据中人名的识别问题。
CHEN Yu is a master’s student at the School of Software, Nankai University, with a background in communications. Her main research interests are natural language processing and data analysis.
In this paper, she is responsible for integrating the early rules and algorithms with deep learning methods and metaphor resolution aiming for the data of People’s Daily and listing announcements to solve the problem of identifying names in text data.
E-mail: 2320200003@mail.nankai.edu.cn|玄宇航,南开大学软件学院,科研助理,主要工作为数据分析、编写数据处理程序。
本文主要工作为上市公告数据的获取和标注,利用scrapy获取文档并使用BIO标注数据。
XUAN Yuhang is a research assistant at the School of Software, Nankai University. His main work is data analysis and writing data processing programs.
In this paper, he is mainly responsible for the acquisition and labeling of listing announcement data. Scrapy and BIO are used to obtain documents and label data respectively.
E-mail: xyh2575179890@163.com|张玉志,南开大学讲席教授,软件学院院长,主要研究方向为人工智能、模式识别、自然语言处理等。
主要工作为针对第一作者研究中遇到的重难点问题,提供指导性意见。
ZHANG Yuzhi is the chair professor and the Dean of the School of Software at Nan-kai University. His research interests include artificial intelli-gence, pattern recognition, natural language processing, etc.
In this paper, he is mainly responsible for the guidance on the difficult problems encountered in the first author’s research.
E-mail: zyz@nankai.edu.cn
基金资助:
国家重点研发计划(2021YFB0300104)

Research On Chinese Name Recognition Based on Deep Learning and Coreference Resolution

CHEN Yu(),XUAN Yuhang(),ZHANG Yuzhi^*()

School of Software, Nankai University, Tianjin 300450, China

Received:2022-02-13 Online:2022-04-20 Published:2022-04-30
Contact: ZHANG Yuzhi

摘要/Abstract

摘要：

【目的】命名实体识别是自然语言处理领域的一项基本任务,实体包括人名、地名和组织名等,与其他实体相比,人名与职务、职务变更及人称代词有关。人名的实体识别中,人名语料的残缺及人称指代不明等问题,成为处理中的难点、痛点。基于此观察,本文提出一种融合指代消解的序列标注方法来改进人名识别,这可以有效缓解人名识别中人名语料不完善的问题,并且可以解决人称代词指代不明、人力耗费量大等问题。【方法】具体地,首先利用职务变更进行数据增强,可以有效解决实际应用中标注数据不足的问题。接着为了更好地学习上下文特征,本文使用语言预训练模型BERT和双向长短时记忆网络结合的方式,并利用条件随机场建模来标签序列的关系。最后,针对文本中的人称代词,加入指代消解算法,进一步改进人名识别。【结果】在公共数据集和本文提出的数据集上的实验结果均表明本文提出方法的有效性。

关键词: 命名实体识别, 指代消解, BERT, 长短时记忆网络

Abstract:

[Objective] Named entity recognition is a basic task in the field of natural language processing. Entities include person names, place names, and organization names. Compared with other entities, person names are related to job titles, job changes, and personal pronouns. In the entity recognition of personal names, the incompleteness of the personal name corpus and the unclear personal designation have become difficulties and pain points in processing. Based on this observation, this paper proposes a sequence tagging method that integrates denotation resolution to improve name recognition, which can effectively alleviate the problem of incomplete name corpus in name recognition, and can solve the problems of unclear personal pronouns and high labor consumption. [Methods] Specifically, using job change to enhance data can effectively solve the problem of insufficient labeled data in practical applications. Then, to better learn contextual features, this approach uses the combination of language pre-training model BERT and bidirectional long-term memory network and uses conditional random field modeling to label the relationship of sequences. Finally, for the personal pronouns in the text, a coreference resolution algorithm is added to further improve name recognition. [Results] The experiment results on both public datasets and the datasets proposed in this paper demonstrate the effectiveness of the proposed method.

Key words: named entity recognition, coreference resolution, BERT, long short-term memory network

陈雨,玄宇航,张玉志. 基于深度学习和指代消解的中文人名识别[J]. 数据与计算发展前沿, 2022, 4(2): 63-73.

CHEN Yu,XUAN Yuhang,ZHANG Yuzhi. Research On Chinese Name Recognition Based on Deep Learning and Coreference Resolution[J]. Frontiers of Data and Computing, 2022, 4(2): 63-73.

图/表 14

图1

图2

图3

图4

图5

图6

图7

表1

图8

表2

表3

图9

表4

表5

参考文献 23

[1]	徐新峰. 基于循环神经网络的中文人名识别的研究[D]. 大连理工大学, 2016.
[2]	宋希良, 韩先培, 孙乐. 面向新类型人名识别的数据增强方法[J]. 中文信息学报, 2019, 33(06):72-79.
[3]	王双双. 面向科技文献作者检索的人名消歧方法研究[D]. 上海师范大学, 2021.
[4]	线岩团, 高凡雅, 相艳, 余正涛, 王剑. 融合多策略数据增强的低资源依存句法分析方法[J]. 计算机科学, 2022, 49(01):73-79.
[5]	陈鸿彬. 汉语句法分析中数据增强方法研究[D]. 北京交通大学, 2021.
[6]	李恒. 基于深度学习的中文指代消解关键技术研究[D]. 华中科技大学, 2020.
[7]	付健. 端到端实体指代消解及相关技术研究[D]. 苏州大学, 2019.
[8]	程志刚. 基于规则和条件随机场的中文命名实体识别方法研究[D]. 华中师范大学, 2015.
[9]	周凡坤. 面向领域的文本信息抽取方法研究[D]. 南京邮电大学, 2014.
[10]	韩普, 姜杰. HMM在自然语言处理领域中的应用研究[J]. 计算机技术与发展, 2010, 20(02):245-248+252.
[11]	刘新亮, 张梦琪, 谷情, 任延昭, 何东彬, 高万林. 基于BERT-CRF模型的生鲜蛋供应链命名实体识别[J]. 农业机械学报, 2021, 52(S1):519-525.
[12]	郑洪浩, 宋旭晖, 于洪涛, 李邵梅, 郝一诺. 基于深度学习的中文命名实体识别综述[J]. 信息工程大学学报, 2021, 22(05):590-596.
[13]	Pingchuan Ma, Bo Jiang, Zhigang Lu, Ning Li, Zhengwei Jiang. Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields[J]. Tsinghua Science and Technology, 2021, 26(03): 259-265. doi: 10.26599/TST.2019.9010033
[14]	邵曦, 陈明. 结合Bi-LSTM和注意力模型的问答系统研究[J]. 计算机应用与软件, 2020, 37(10):52-56.
[15]	张晓, 李业刚, 王栋, 史树敏. 基于迁移学习的社交评论命名实体识别[J]. 计算机应用与软件, 2022, 39(01):143-150.
[16]	刘卓凡, 郑庆庆, 李俊, 廖思翀, 冯宜晖. 基于注意力机制和LSTM的文本情感分析[J]. 信息与电脑(理论版), 2021, 33(18):63-65.
[17]	孙浩, 雒伟群, 赵尔平, 王伟, 崔志远. 基于BERT的Base与Large版的领域命名实体识别研究[J]. 计算机与数字工程, 2021, 49(12):2455-2461.
[18]	曾青霞, 熊旺平, 杜建强, 聂斌, 郭荣传. 结合自注意力的BiLSTM-CRF的电子病历命名实体识别[J]. 计算机应用与软件, 2021, 38(03):159-162+242.
[19]	谢腾, 杨俊安, 刘辉. 基于BERT-BiLSTM-CRF模型的中文实体识别[J]. 计算机系统应用, 2020, 29(07):48-55.
[20]	李恒. 基于深度学习的中文指代消解关键技术研究[D]. 华中科技大学, 2020.DOI: 10.27157/d.cnki.ghzku.2020.001486. doi: 10.27157/d.cnki.ghzku.2020.001486
[21]	蒋承知. 基于深度学习的大间隔分类方法研究[D]. 电子科技大学, 2021.DOI: 10.27005/d.cnki.gdzku.2021.002589. doi: 10.27005/d.cnki.gdzku.2021.002589
[22]	周炫余, 刘娟, 邵鹏, 罗飞, 刘洋. 基于层次过滤模型的中文指代消解[J]. 吉林大学学报(工学版), 2016, 46(04):1209-1215.
[23]	禤镇宇, 蒋盛益, 张礼明, 包睿. 基于多特征Bi-LSTM-CRF的影评人名识别研究[J]. 中文信息学报, 2019, 33(03):94-101.

语料	训练集	测试集	验证集
人民日报	12122	4040	4040
上市公告	7322	2444	2444

模型	准确率	召回率	F1值
LSTM-CRF	86.02	81.37	83.63
BiLSTM	83.56	80.12	81.80
BiLSTM-CRF	86.78	82.13	84.39
基准模型	90.37	88.54	89.45
基准+消解模型	90.62	88.78	89.69

模型	准确率	召回率	F1值
LSTM-CRF	94.38	92.72	93.54
BiLSTM	93.43	91.52	92.47
BiLSTM-CRF	95.42	95.07	95.24
基准模型	97.88	97.32	97.60
基准+消解模型	97.92	97.79	97.85

模型	准确率	召回率	F1值
基准+消解模型	90.62	88.78	89.69
多特征模型	88.71	90.43	89.56

模型	准确率	召回率	F1值
基准+消解模型	97.92	97.79	97.85
多特征模型	95.24	96.92	96.07

基于深度学习和指代消解的中文人名识别

Research On Chinese Name Recognition Based on Deep Learning and Coreference Resolution

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 23

相关文章 2

编辑推荐

Metrics

本文评价

[1]	童昭,王露笛,朱小杰,杜一. 基于预训练模型的军事领域命名实体识别研究[J]. 数据与计算发展前沿, 2022, 4(5): 120-128.
[2]	李贞贞,钟永恒,王辉,刘佳,孙源. 基于深度学习与统计信息的领域术语抽取方法研究[J]. 数据与计算发展前沿, 2022, 4(2): 87-98.