数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (2): 63-73.

doi: 10.11871/jfdc.issn.2096-742X.2022.02.006

• 专刊:先进智能计算平台及应用 • 上一篇    下一篇

基于深度学习和指代消解的中文人名识别

陈雨(),玄宇航(),张玉志*()   

  1. 南开大学,软件学院,天津 300450
  • 收稿日期:2022-02-13 出版日期:2022-04-20 发布日期:2022-04-30
  • 通讯作者: 张玉志
  • 作者简介:陈雨,南开大学软件学院,研究生,有通信专业背景,主要研究方向为自然语言处理、数据分析。
    本文主要工作为针对人民日报和上市公告数据,将早期的规则算法,与深度学习方法、指代消解相结合,解决眼下处理文本数据中人名的识别问题。
    CHEN Yu is a master’s student at the School of Software, Nankai University, with a background in communications. Her main research interests are natural language processing and data analysis.
    In this paper, she is responsible for integrating the early rules and algorithms with deep learning methods and metaphor resolution aiming for the data of People’s Daily and listing announcements to solve the problem of identifying names in text data.
    E-mail: 2320200003@mail.nankai.edu.cn|玄宇航,南开大学软件学院,科研助理,主要工作为数据分析、编写数据处理程序。
    本文主要工作为上市公告数据的获取和标注,利用scrapy获取文档并使用BIO标注数据。
    XUAN Yuhang is a research assistant at the School of Software, Nankai University. His main work is data analysis and writing data processing programs.
    In this paper, he is mainly responsible for the acquisition and labeling of listing announcement data. Scrapy and BIO are used to obtain documents and label data respectively.
    E-mail: xyh2575179890@163.com|张玉志,南开大学讲席教授,软件学院院长,主要研究方向为人工智能、模式识别、自然语言处理等。
    主要工作为针对第一作者研究中遇到的重难点问题,提供指导性意见。
    ZHANG Yuzhi is the chair professor and the Dean of the School of Software at Nan-kai University. His research interests include artificial intelli-gence, pattern recognition, natural language processing, etc.
    In this paper, he is mainly responsible for the guidance on the difficult problems encountered in the first author’s research.
    E-mail: zyz@nankai.edu.cn
  • 基金资助:
    国家重点研发计划(2021YFB0300104)

Research On Chinese Name Recognition Based on Deep Learning and Coreference Resolution

CHEN Yu(),XUAN Yuhang(),ZHANG Yuzhi*()   

  1. School of Software, Nankai University, Tianjin 300450, China
  • Received:2022-02-13 Online:2022-04-20 Published:2022-04-30
  • Contact: ZHANG Yuzhi

摘要:

【目的】命名实体识别是自然语言处理领域的一项基本任务,实体包括人名、地名和组织名等,与其他实体相比,人名与职务、职务变更及人称代词有关。人名的实体识别中,人名语料的残缺及人称指代不明等问题,成为处理中的难点、痛点。基于此观察,本文提出一种融合指代消解的序列标注方法来改进人名识别,这可以有效缓解人名识别中人名语料不完善的问题,并且可以解决人称代词指代不明、人力耗费量大等问题。【方法】具体地,首先利用职务变更进行数据增强,可以有效解决实际应用中标注数据不足的问题。接着为了更好地学习上下文特征,本文使用语言预训练模型BERT和双向长短时记忆网络结合的方式,并利用条件随机场建模来标签序列的关系。最后,针对文本中的人称代词,加入指代消解算法,进一步改进人名识别。【结果】在公共数据集和本文提出的数据集上的实验结果均表明本文提出方法的有效性。

关键词: 命名实体识别, 指代消解, BERT, 长短时记忆网络

Abstract:

[Objective] Named entity recognition is a basic task in the field of natural language processing. Entities include person names, place names, and organization names. Compared with other entities, person names are related to job titles, job changes, and personal pronouns. In the entity recognition of personal names, the incompleteness of the personal name corpus and the unclear personal designation have become difficulties and pain points in processing. Based on this observation, this paper proposes a sequence tagging method that integrates denotation resolution to improve name recognition, which can effectively alleviate the problem of incomplete name corpus in name recognition, and can solve the problems of unclear personal pronouns and high labor consumption. [Methods] Specifically, using job change to enhance data can effectively solve the problem of insufficient labeled data in practical applications. Then, to better learn contextual features, this approach uses the combination of language pre-training model BERT and bidirectional long-term memory network and uses conditional random field modeling to label the relationship of sequences. Finally, for the personal pronouns in the text, a coreference resolution algorithm is added to further improve name recognition. [Results] The experiment results on both public datasets and the datasets proposed in this paper demonstrate the effectiveness of the proposed method.

Key words: named entity recognition, coreference resolution, BERT, long short-term memory network