Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (2): 141-148.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.02.014

doi: 10.11871/jfdc.issn.2096-742X.2025.02.014

• Technology and Application • Previous Articles     Next Articles

Normalization of Chinese Institutional Names Based on Trie Tree Search and Unessential Words Elimination

ZHAO Jing1(),JIANG Shuming2,*(),MA Qiyun1   

  1. 1. School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, Shandong 250000, China
    2. Information Research Institute of Shandong Academy of Sciences, Qilu University of Technology (Shandong Academy of Sciences), Jinan, Shandong 250000, China
  • Received:2024-11-01 Online:2025-04-20 Published:2025-04-23
  • Contact: JIANG Shuming E-mail:zhaoj_0321@163.com;jsm@qlu.edu.cn

Abstract:

[Background] When processing institution name data, we often encounter the problem of inconsistent institution names. Due to cognitive differences and subjective preferences among individuals, the same institution may be assigned multiple non-standard names. These non-standard names are usually based on common cognitive knowledge, widely understood and accepted, and there is usually no situation where one non-standard name corresponds to multiple standardized names. [Methods] Based on this, this article proposes a Chinese institution name normalization algorithm based on Trie tree search and unessential words elimination. The automatic normalization of Chinese institution names has been achieved through unessential words elimination, Trie tree fuzzy matching, and review to obtain superior results, improving the accuracy and efficiency of data integration. [Conclusions] Experimental results show that this method performs well in improving the accuracy of institution name normalization and matching efficiency.

Key words: normalization, unessential words elimination, data cleaning, Trie tree, edit distance, review for optimization