数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (2): 141-148.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.02.014

doi: 10.11871/jfdc.issn.2096-742X.2025.02.014

• 技术与应用 • 上一篇    下一篇

基于Trie树查找和非关键词消除的中文机构名称归一化

赵静1(),姜树明2,*(),马启云1   

  1. 1.齐鲁工业大学(山东省科学院)计算机科学与技术学部,山东 济南 250000
    2.齐鲁工业大学(山东省科学院)山东省科学院情报研究所,山东 济南 250000
  • 收稿日期:2024-11-01 出版日期:2025-04-20 发布日期:2025-04-23
  • 通讯作者: 姜树明
  • 作者简介:赵静,齐鲁工业大学(山东省科学院),硕士研究生,主要研究方向为数据挖掘应用,数据处理。
    本文承担工作为:算法设计与实现。
    Zhao Jing is a master’s student at Qilu University of Technology (Shandong Academy of Sciences). Her main research interests include data mining applications and data processing.
    In this paper, she is mainly responsible for algorithm design and implementation.
    E-mail: zhaoj_0321@163.com|姜树明,齐鲁工业大学(山东省科学院)山东省科学院情报研究所,硕士生导师,主要研究方向为多媒体数据处理、数据挖掘应用研究等。
    本文承担工作为:指导算法设计和优化。
    Jiang Shuming is a master’s supervisor at the Information Research Institute of Shandong Academy of Sciences, Qilu University of Technology (Shandong Academy of Sciences). His main researcj interests include multimedia data processing and data mining application research.
    In this paper, he is mainly responsible for guiding algorithm design and optimization.
    E-mail: jsm@qlu.edu.cn
  • 基金资助:
    山东省科技型中小企业创新能力提升工程(2023TSGC0135)

Normalization of Chinese Institutional Names Based on Trie Tree Search and Unessential Words Elimination

ZHAO Jing1(),JIANG Shuming2,*(),MA Qiyun1   

  1. 1. School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, Shandong 250000, China
    2. Information Research Institute of Shandong Academy of Sciences, Qilu University of Technology (Shandong Academy of Sciences), Jinan, Shandong 250000, China
  • Received:2024-11-01 Online:2025-04-20 Published:2025-04-23
  • Contact: JIANG Shuming

摘要:

【应用背景】在处理机构名称数据时,经常遇到机构名称不一致的问题。由于个体间的认知差异和主观偏好,同一机构可能会被赋予多个非规范名称。这些非规范名称通常基于普遍的认知常识、能够被广泛理解和接受,并且通常不会出现一个非规范名称对应多个规范名称的情况。【方法】基于此,提出了一种基于Trie树查找和非关键词消除的中文机构名称归一化算法。通过非关键词消除、Trie树模糊匹配和复核取优等步骤,实现了中文机构名称的自动归一化,提升了数据整合的准确性和效率。【结论】实验结果表明,该方法在提高机构名称归一化准确率和匹配效率方面表现较好。

关键词: 归一化, 非消除, 数据清洗, Trie树, 编辑距离查找, 复核取优

Abstract:

[Background] When processing institution name data, we often encounter the problem of inconsistent institution names. Due to cognitive differences and subjective preferences among individuals, the same institution may be assigned multiple non-standard names. These non-standard names are usually based on common cognitive knowledge, widely understood and accepted, and there is usually no situation where one non-standard name corresponds to multiple standardized names. [Methods] Based on this, this article proposes a Chinese institution name normalization algorithm based on Trie tree search and unessential words elimination. The automatic normalization of Chinese institution names has been achieved through unessential words elimination, Trie tree fuzzy matching, and review to obtain superior results, improving the accuracy and efficiency of data integration. [Conclusions] Experimental results show that this method performs well in improving the accuracy of institution name normalization and matching efficiency.

Key words: normalization, unessential words elimination, data cleaning, Trie tree, edit distance, review for optimization