数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (3): 92-110.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.03.007

doi: 10.11871/jfdc.issn.2096-742X.2023.03.007

• 技术与应用 • 上一篇    下一篇

文本嵌入技术的研究与应用进展

赵悦阳1,*(),崔雷2   

  1. 1.中国医科大学附属盛京医院,图书馆,辽宁 沈阳 110004
    2.中国医科大学,医学健康管理学院,辽宁 沈阳 110122
  • 收稿日期:2022-02-21 出版日期:2023-06-20 发布日期:2023-06-21
  • 通讯作者: *赵悦阳(E-mail: zhaoyy@sj-hospital.org
  • 作者简介:赵悦阳,中国医科大学附属盛京医院图书馆,副主任,副研究馆员,硕士,主要研究方向为文本挖掘、生物信息学、文献计量学。编写专著1部,承担省级课题1项,发表论文13篇。
    本文负责论文初稿撰写。
    ZHAO Yueyang, Deputy Director of the Library of Shengjing Hospital Affiliated to China Medical University, Associate Research Librarian, Master. Her main research interests are text mining, bioinformatics, and bibliometrics. She has written 1 monograph, undertaken 1 provincial project, and published 13 papers.
    In this paper, she is responsible for writing the first draft.
    E-mail: zhaoyy@sj-hospital.org
  • 基金资助:
    辽宁省社会科学规划基金项目“在多层次相似性共嵌入空间中通过判别学习进行知识发现的描述性文档聚类”(L20BTQ003)

Progress in Research and Application of Text Embedding Technology

ZHAO Yueyang1,*(),CUI Lei2   

  1. 1. Library of Shengjing Hospital, China Medical University, Shenyang, Liaoning 110004, China
    2. Health Management School, China Medical University, Shenyang, Liaoning 110122, China
  • Received:2022-02-21 Online:2023-06-20 Published:2023-06-21

摘要:

【目的】 本文对国内外已经发表的自然语言处理领域有关文本嵌入的研究进行较深入的分析和对比,详细描述文本嵌入的知识结构和发展脉络,以及针对不同领域、不同数据集的模型改进方法,讨论流行的嵌入模型,比较每个模型在文本嵌入中的优缺点,同时指出文本嵌入所面临的挑战,提出可能的解决方案。【方法】 检索Web of Science 数据库、CNKI 数据库和万方数据,获取国内外文本嵌入研究的相关文献,运用内容分析法对文献做系统梳理分析,对这些文献中利用的文本嵌入技术以及改进方案、建模思想、生成过程等方面进行对比与分析。【结果】 经过去重和合并,保留内容最相关的61篇文献。文本嵌入方法可以归纳为三类:基于频率的文本嵌入、基于神经网络的文本嵌入和基于主题建模的文本嵌入。针对语料库的规模大小、多义词嵌入、通用嵌入的域适应等文本嵌入所面临的挑战,从被调查的研究文章中提出了可能的解决方案。

关键词: 文本嵌入, 自然语言处理, 内容分析法

Abstract:

[Objective] This article conducts an in-depth analysis and comparison of the research on text embedding and describes the basic model of text embedding and the model improvement methods for different fields and different data sets. Popular embedding models are discussed and the advantages and disadvantages of the models are compared. [Methods] The relevant documents of text embedding research at home and abroad are obtained from the Web of Science database、CNKI database and WanFang database and the text embedding technologies, improvement schemes, and modeling ideas are systematically analyzed. [Results] After deduplication and merging, 61 documents with the most relevant content are retained. Text embedding methods can be summarized into three categories: text embedding based on frequency, text embedding based on neural network, and text embedding based on topic modeling. Given the challenges faced by text embeddings such as the size of the corpus, polysemous word embedding, and universal embedding domain adaptation, possible solutions are extrated from the research articles under investigation.

Key words: text embedding, natural language processing, content analysis