文本嵌入技术的研究与应用进展

doi:10.11871/jfdc.issn.2096-742X.2023.03.007

数据与计算发展前沿 ›› 2023, Vol. 5 ›› Issue (3): 92-110.

CSTR: 32002.14.jfdc.CN10-1649/TP.2023.03.007

doi: 10.11871/jfdc.issn.2096-742X.2023.03.007

文本嵌入技术的研究与应用进展

赵悦阳^1,^*(),崔雷²

1.中国医科大学附属盛京医院，图书馆，辽宁沈阳 110004
2.中国医科大学，医学健康管理学院，辽宁沈阳 110122

收稿日期:2022-02-21 出版日期:2023-06-20 发布日期:2023-06-21
通讯作者: *赵悦阳（E-mail: zhaoyy@sj-hospital.org）
作者简介:赵悦阳，中国医科大学附属盛京医院图书馆，副主任，副研究馆员，硕士，主要研究方向为文本挖掘、生物信息学、文献计量学。编写专著1部，承担省级课题1项，发表论文13篇。
本文负责论文初稿撰写。
ZHAO Yueyang, Deputy Director of the Library of Shengjing Hospital Affiliated to China Medical University, Associate Research Librarian, Master. Her main research interests are text mining, bioinformatics, and bibliometrics. She has written 1 monograph, undertaken 1 provincial project, and published 13 papers.
In this paper, she is responsible for writing the first draft.
E-mail: zhaoyy@sj-hospital.org
基金资助:
辽宁省社会科学规划基金项目“在多层次相似性共嵌入空间中通过判别学习进行知识发现的描述性文档聚类”(L20BTQ003)

Progress in Research and Application of Text Embedding Technology

ZHAO Yueyang^1,^*(),CUI Lei²

1. Library of Shengjing Hospital, China Medical University, Shenyang, Liaoning 110004, China
2. Health Management School, China Medical University, Shenyang, Liaoning 110122, China

Received:2022-02-21 Online:2023-06-20 Published:2023-06-21

摘要/Abstract

摘要：

【目的】 本文对国内外已经发表的自然语言处理领域有关文本嵌入的研究进行较深入的分析和对比，详细描述文本嵌入的知识结构和发展脉络，以及针对不同领域、不同数据集的模型改进方法，讨论流行的嵌入模型，比较每个模型在文本嵌入中的优缺点，同时指出文本嵌入所面临的挑战，提出可能的解决方案。【方法】 检索Web of Science 数据库、CNKI 数据库和万方数据，获取国内外文本嵌入研究的相关文献，运用内容分析法对文献做系统梳理分析，对这些文献中利用的文本嵌入技术以及改进方案、建模思想、生成过程等方面进行对比与分析。【结果】 经过去重和合并，保留内容最相关的61篇文献。文本嵌入方法可以归纳为三类：基于频率的文本嵌入、基于神经网络的文本嵌入和基于主题建模的文本嵌入。针对语料库的规模大小、多义词嵌入、通用嵌入的域适应等文本嵌入所面临的挑战，从被调查的研究文章中提出了可能的解决方案。

关键词: 文本嵌入, 自然语言处理, 内容分析法

Abstract:

[Objective] This article conducts an in-depth analysis and comparison of the research on text embedding and describes the basic model of text embedding and the model improvement methods for different fields and different data sets. Popular embedding models are discussed and the advantages and disadvantages of the models are compared. [Methods] The relevant documents of text embedding research at home and abroad are obtained from the Web of Science database、CNKI database and WanFang database and the text embedding technologies, improvement schemes, and modeling ideas are systematically analyzed. [Results] After deduplication and merging, 61 documents with the most relevant content are retained. Text embedding methods can be summarized into three categories: text embedding based on frequency, text embedding based on neural network, and text embedding based on topic modeling. Given the challenges faced by text embeddings such as the size of the corpus, polysemous word embedding, and universal embedding domain adaptation, possible solutions are extrated from the research articles under investigation.

Key words: text embedding, natural language processing, content analysis

赵悦阳, 崔雷. 文本嵌入技术的研究与应用进展[J]. 数据与计算发展前沿, 2023, 5(3): 92-110.

ZHAO Yueyang, CUI Lei. Progress in Research and Application of Text Embedding Technology[J]. Frontiers of Data and Computing, 2023, 5(3): 92-110, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2023.03.007.

图/表 3

表1

61篇文本嵌入研究论文应用模型以及贡献"

基础模型	改进模型	应用	改进效果
BOW	1.基于词袋的支持向量机（BOW-SVM）； 2. BOW与SVM和径向基函数(RBF)建模的文档嵌入	社会风险分类；安全事件报告	1. BOW-SVM模型与仅使用文档嵌入相比，AUC和f1-score值最高，分别72%和66%^[15]； 2. BOW-SVM模型对BBS帖子进行社会风险分类比PV-SVM模型效果好^[16]
TF-IDF	1.多协同训练（MCT）； 2.将经典术语频率（TF）统计信息整合到数学方程式	文档分类；作者分析；情感分析；将临床文本映射到医学代码；缺陷修复；维护和增强短文本的语义表示	1. MCT增强了传统SSL方法的分类性能。NB分类器的性能提升更为明显^[17] 2. GPE-WS模型在9个数据集上试验，和其它方法相比，性能均排名前四分之一^[18]
Word2vec(Skip-gram，PV-DBOW )	1.使用具有多个判别分析（multiple discri-minant analysis，MDA）； 2.提出一种概念性短文本嵌入（CSE）模型，为每个短文本分配关联的概念，然后将概念化结果引入学习概念短文本嵌入中； 3.基于流形约束提出新的目标函数（semi-DBOW）； 4.自动化语义丰富系统，基于内容的流行度预测系统开发语义丰富的文章相似度方法(W2V-PPS)； 5.提出一种表示文档集合的方法Babel2V-ec； 6.短文本嵌入自动编码器（Short Texts Em-bedding AutoEncoders: STE-AE）； 7.利用Jaccard相似系数x IDF 整合时间序列（EDM-JBW）； 8.个性增强的概率矩阵分解方法(P2MF)	情感分类；预测与网络安全领域相关的新闻文章的受欢迎程度；提高文本分类性能；提取有区别的低维短文本嵌入；新闻事件检测；推荐系统提高推荐性能	1. 判别式文档嵌入的准确性提高了 21%^[19]； 2. aCSE-1（Ours）在数据集Twitter中的召回率超过了最佳基线模型TWE 5.3%和PV-DM9.0%^[20]； 3. semi-DBOW在分离文档的正负面情绪方面表现良好；70% 的情感标签得到很好改进^[21]； 4. W2V-PPS比基线模型表现更好，在预测新闻四种受欢迎程度上F分数产生是 98%、76%、71% 和 72%^[22]； 5. Babel2Vec的最佳Micro-F1在 12 个测试集中有 8 个优于BOW表示的最佳精度，在不以大类为主的 Macro-F1结果的分析中，Babel2V-ec 排名第一^[23-24]； 6. P2MF分别获得了大约3%、21%、36%、6% 和 16% 的均方根误差性能增益^[25]
Doc2Vec	1. 基于负采样的域适应的单词和文档的分布式表示学习方法； 2.一种视觉分析系统，用于探索神经文档嵌入； 3.PathEmb（路径嵌入）的全局路径相似性搜索算法，该算法结合了随机游走和文档嵌入技术； 4.提出一种细粒度的移动应用程序聚类模型，利用词嵌入和文档嵌入来合并相似的簇； 5.引入一种新的跨主题作者属性归属方法； 6.新的链接预测方法，通过反映技术词的功能上下文来预测文档之间的潜在链接； 7.构建文档嵌入模型中最优维度及最优窗口的选择模型，并根据文本用词和文档主题语义特征构建了高维空间中的文档嵌入向量； 8.自动、高效且细粒度的恶意软件分析方法mal2vec	解决了来自不同域的文档嵌入的域分离问题；对产品，应用程序进行分类；学习隐藏在文档中的作者的语义，句法和语法模式，识别作者写作风格；链接预测，预测相似但不相关的文档之间的联系；主题发现；恶意软件家族分类；文献检索	1. 提出的方法始终擅长混合来自两个域的文档表示，且proxy A-距离总是很低^[26]； 2. 与专家合作检验模型，专家证实了该模型的有用性和意义，并迅速发现绿色集群（C2：ADHD 治疗）是最相关的^[27]； 3. PathEmb的查询时间低于7毫秒，在计算效率和搜索精度方面优于现有方法^[28]； 4. 与K-means算法相比，改进模型的纯度提高了0.19，熵降低了1.18。此外，与支持向量机分类器相比，平均精度提高了 0.09 以上^[29]； 5. 与卫报语料库上报告57% 的准确度相比，改进模型准确度提高了12.83%^[30]； 6. 改进模型在链接预测中显示出很高的召回值，意味着所提出的方法可以找到许多未来可以连接的案例。在精度方面，虽然该方法的精度较低，但高于词向量技术^[31]； 7. 对于已知确定作者的作品分类准确率达99.6%，对于风格较为接近的作者也可以有效识别^[32]； 8. mal2vec分类准确率达到100%，F1分数也达到了96% ^[33]
GloVe	1.文档主题嵌入模型 DocTopic2Vec； 2.研究了空间和时间距离与文本相似性度量的不同组合，以改善事件检测结果（DSTTM）； 3.训练多层感知器以将文本嵌入映射到产品属性选项	情感分析；灾难检测：通过对带有地理标签的推文进行时空和文本分析，提取有关大型研究区域中出现的事件的信息，为监测灾难的当前状态提供了可能性；循证医学中的文档筛选；产品配置	1. DocTopic2Vec获得了最佳结果和总体最佳精度（即 0.7718）^[34]； 2. DSTTM提取了不同密度的集群，并且与其它模型相比，能够提取更多细节的局部集群^[35]； 3. 模型通过最大池化操作从文本中提取显着和重要信息，计算复杂度方面优于LSTM和 CNN^[36]
FastText	1.文档主题嵌入模型 DocTopic2Vec； 2.研究了空间和时间距离与文本相似性度量的不同组合，以改善事件检测结果； 3.将经典术语频率（TF）统计信息整合到数学方程式	情感分析；作者分析；灾难检测：通过对带有地理标签的推文进行时空和文本分析，提取有关大型研究区域中出现的事件的信息，为监测灾难的当前状态提供了可能性	1. DocTopic2Vec获得了最佳结果和总体最佳精度（即 0.7718）^[34]； 2. DSTTM提取了不同密度的集群，并且与其它模型相比，能够提取更多细节的局部集群^[35]； 3. GPE-WS模型在9个数据集上试验，和其它方法相比，性能均排名前四分之一^[18]
CNN	1.双路径CNN模型用于图像识别； 2.基于文本分类的多重表示的 CNN 新架构，通过构建多个平面，以便将更多信息转储到网络中，例如通过命名实体识别器获得的文本的不同部分或部分 -语音标记工具、不同级别的文本嵌入或上下文句子； 3.使用局部自顶向下方法将卷积神经网络 (CNN) 应用于分层文本分类； 4.通过图像和文本嵌入模型（DLI-IT）识别药物标签，检测可疑药物； 5.多角度特征提取(Multiple Angle Feature，MAF) 混合模型，其能够融合功能信息对专利进行动态聚类，在聚类过程中动态调整网络参数以获得更好的性能	图像识别；文本分类；分层文本分类；识别药物标签；专利信息提取	1. 改进模型在两个通用检索数据集Flickr30k和MSCOCO上取得了有竞争力的结果。此外，在人物检索数据集CUHK上得到了18% 的改进^[37]； 2. 尽管多平面、字符级卷积网络模型可以在分类任务中实现最先进的性能，但与传统的文本分类模型（例如n-gram）相比，仍有改进的空间^[38]； 3. 带有静态文本嵌入的CNN模型的分层局部分类方法分别超过平面SVM和LR基线模型 7% 和 13%，超过平面 CNN基线模型 5%，并超过h-SVM和h-LR模型分别是5% 和 10%^[39]； 4. DLI-IT模型在药物标签识别方面实现了高达 88%的精度，比以前基于图像或基于文本的识别方法提高了高达 35%^[40]； 5. 对功能信息句进行特征提取可以降低部分噪声，F-measure值得到了提高，同时也减少了聚类时间^[41]
BRNN	多目标、协作和专注的框架，称为MO-CA，用于文档级情感分析。MOCA具有三个重要特征：（1）显性影响的注意力模型； MOCA应用具有注意力机制的双向循环神经网络来学习特定于用户项目的文本嵌入，以利用用户和项目的显式影响；（2）隐性影响的协作模型； MOCA设计了一种基于多层感知器的新型神经协同过滤模型，以捕捉隐含在用户和物品之间高度个性化交互中的隐含影响；（3）多目标优化；MOCA将此问题建模为分类和回归任务，并同时优化两个目标以加强彼此。	文本情感分析	MOCA将准确度从3%提高到5%，并将RMSE从5%降低到10%^[42]
LSTM	1.Fusion-CNN and Fusion-LSTM； 2.通用的深度情感哈希模型，它由三个步骤组成。首先，训练基于分层注意力的长短期记忆网络 (LSTM) 以获得特定于情感的文档表示。其次，给定文档嵌入，k-最近邻（kNN）算法用于构建拉普拉斯矩阵，该矩阵随后通过拉普拉斯特征映射（LapEig）投影到哈希标签中。第三，为哈希函数学习构建了一个深度模型，该模型由生成的哈希标签和原始情感标签共同监督； 3.带反馈的递归神经网络模型	患者信息预测建模；手语识别，提取手语信息；基于情感的文本检索；文本分析	1.考虑到非结构化文本，Fusion-CNN 和Fusion-LSTM的AUROC 分数分别提高了0.043 和0.034^[43]； 2. 改进模型的平均精度均高于其它方法，F1 得分平均高出 0.3%，F0.5 得分平均高出 0.4%^[44]； 3. 不仅时间和空间利用率高，而且具有更好的话题分析效果^[45]
Bi-LSTM	1.多角度特征提取( Multiple Angle Feature, MAF) 混合模型，其能够融合功能信息对专利进行动态聚类，在聚类过程中动态调整网络参数以获得更好的性能； 2. 利用Word2Vec 把用户回答的短文本进行向量化，然后利用双向LSTM（BiLS-TM）对文本进行分类，是解决用户回答问题文本分类的最佳选择； 3.对抗性属性文本嵌入（adversarial attrib-ute-text embedding: AATE）网络； 4.基于标签语义注意力的多标签文本分类(LAbel Semantic Attention Multi-label Class-ification, 简称LASA)方法； 5.基于全字掩码的中国城市数据识别模型，用于 Transformers 的双向编码器表示（BERT-WWM）嵌入模型和带条件随机场的双向长短期记忆（BLSTM-CRF）序列标记模型	专利信息提取；短文本分类分类人员搜索；多标签文本分类；城市文本数据	1.F-measure值提高了将近10%，聚类时间也缩短了近2秒，证明了BiL- STM Attention对特征提取的有效性^[41]； 2.模型最终准确率为95%，在各分类上最高的精确率能达到98.1%，最低的有91.7%，分类效果比较好^[46]； 3.AATE模型的 top-1、top-5 和 top-10准确率分别是的 52.42%、74.98%、82.74%，现有的相似性学习网络方法相比，AATE大大提高了性能^[47]； 4.LASA对比XML-CNN，在不同数据集上精度提升分别是235.79%和257.14%；对比Attention-X-ML，精度提升分别为112.67%和245.30%^[48]； 5. 改进模型取得了最高的 P（90.18%、92.78%）、R（98.36%、99.32%）、F（93.60%、95.94%）^[49]
GRU	1.用 BERT 训练词向量，用双向 GRU 网络进行高效的特征提取，同时融合注意力机制作为辅助特征嵌入的文本分类模型 BBGA(BERT based Bidirectional GRU with Attention)； 2. 基于SOTA文本嵌入技术和GRU构建模型来预测情绪基调的变化	中文文本分类；情感分析：社交网络	1.BBGA模型，在处理THUCNews数据集时，精确度达到了94.34%，比TextCNN高出5.20%，比BERT_RNN高出1.01%^[50]； 2.预测模型能正确地通过情绪基调EmT(cn)捕捉到作者和评论者的交互效果^[51]
BERT	1.用BERT训练词向量，用双向GRU 网络进行高效的特征提取，同时融合注意力机制作为辅助特征嵌入的文本分类模型 BBGA(BERT based Bidirectional GRU with Attention)； 2.进标签语义信息嵌入的多标签文本分类模型(label embedding multi label text class-ification, LEMLTC)； 3. 提出一种融合了序列和图结构的机器阅读理解的新模型	中文文本分类；多标签文本分类；机器阅读；微博文本情感分类	1.BBGA模型，在处理THUCNews数据集时，精确度达到了94.34%，比TextCNN高出5.20%，比BERT_RNN高出1.01%^[50]； 2.在AAPD和Reuters-21578 数据集上进行实验，F1 值分别提高了3.92% 和0.3%^[52]； 3. 与BERT相比，新模型EM 值提高了7.8%，F1值提高了6.6%^[53]
LDA	1.多角度特征提取( Multiple Angle Featu-re，MAF) 混合模型，其能够融合功能信息对专利进行动态聚类，在聚类过程中动态调整网络参数以获得更好的性能； 2.判别局部文档嵌入（Disc-LDE），通过在子空间上保留文档生成结构来构建用于文档嵌入的平滑仿射图； 3.自聚合主题模型，通过结合文档嵌入 (DESTM) 来处理单词共现信息的稀疏性； 4.提出了一种用于文档嵌入的双向循环注意主题模型（bi-RATM）； 5.Node2vec 获得源自服务调用和标记图的结构化表示向量	专利信息提取；提高聚类和分类性能；短文本主题建模；文档建模和分类；主题检测；分类 Web 服务	1.F-measure虽然提高有限，但是也间接证明了本文改进的LDA提取主题向量的有效性^[41]； 2.Disc-LDE能够捕获更好的嵌入来提高同一类文档的相似性度量。像亚马逊评论这样的不平衡和噪声数据集，Disc-LDE 也取得了令人满意的结果^[54]； 3.DESTM模型根据文档嵌入成功地将语义相关的短文本聚合在一起，并产生足够的和语义相关的附加词共现信息^[55]； 4.bi-RATM模型具有捕获和提取相邻句子之间潜在关系的能力^[56]； 5.在 Mashup和API数据集上，改进模型与其它方法相比，F1值均有大幅度提升^[57]
GCN	基于 GCN 的文本句法编码器和预训练的 BERT 序列嵌入与事件感知掩码语言机制的集成，称为 SynSeq4ED	事件信息提取监测	SynSeq4ED模型在事件触发识别/分类、参数识别和参数角色分类等方面的F1分数都是最高的^[4]

表1

表2

表3

参考文献 104

[1]	Cao QM, Guo Q, Wang YL, et al. Text clustering using VSM with feature clusters[J]. Neural Computing & App-lications, 2015, 26(4):995-1003.
[2]	Bing LD, Jiang S, Lam W, et al. Adaptive Concept Reso-lution for document representation and its applications in text mining[J]. Knowledge Based Systems, 2015, 74:1-13. doi: 10.1016/j.knosys.2014.10.003
[3]	Dragoni M, Pereira C, Tettamanzi A. A conceptual repr-esentation of documents and queries for information re-trieval systems by using light ontologies[J]. Expert Systems with Applications, 2012, 39(12):10376-10388. doi: 10.1016/j.eswa.2012.01.188
[4]	Vo T. SynSeq4ED: A Novel Event-Aware Text Repres-entation Learning for Event Detection[J]. Neural Proce-ssing Letters, 2022, 54: 227-249.
[5]	Jing L, Ng M K, Huang J Z. Knowledge-based vector space model for text clustering[J]. Knowledge & Infor-mation Systems, 2010, 25(1):35-55.
[6]	Yang L, Xu S. A local context-aware LDA model for topic modeling in a document network[J]. Journal of the Association for Information Science & Technology, 2017, 68:1429-1448.
[7]	Duan D, Li Y, Li R, et al. LIMTopic: A Framework of Incorporating Link Based Importance into Topic Mode-ling[J]. IEEE Transactions on Knowledge & Data Engi-neering, 2014, 26(10):2493-2506.
[8]	Xu G, Meng Y, Chen Z, et al. Research on Topic Detec-tion and Tracking for Online News Texts[J]. IEEE Acc-ess, 2019, 7(99):58407-58418.
[9]	Mikolov T, Sutskever I, Chen K, et al. Distributed Repr-esentations of Words and Phrases and their Compositi-onality[J]. Advances in Neural Information Processing Systems, 2013, 26(1):3111-3119.
[10]	Le Q, Mikolov T. Distributed representations of senten-ces and documents[C]. Proceedings of the 31st Intern-ational Conference on Machine Learning, PMLR, 2014, 32(2):1188-1196.
[11]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Unders-tanding[J]. 2018. arXiv:1810.04805v2 [cs.CL].
[12]	Chen M. Efficient Vector Representation for Documents through Corruption[C]. In ICLR, 2017: 1-13.
[13]	Rao G, Huang W, Feng Z, et al. LSTM with sentence representations for Document-level Sentiment Classifi-cation[J]. Neurocomputing, 2018, 308:49-57. doi: 10.1016/j.neucom.2018.04.045
[14]	Fu M, Qu H, Huang L, et al. Bag of Meta-words: A Nov-el Method to Represent Document for The Sentiment Class-ification[J]. Expert Systems with Applications, 2018, 113:33-43. doi: 10.1016/j.eswa.2018.06.052
[15]	Fong A, Komolafe T, Adams K T, et al. Exploration and Initial Development of Text Classification Models to Identify Health Information Technology Usability-Rela-ted Patient Safety Event Reports[J]. Applied Clinical Informatics, 2019, 10(3): 521-527. doi: 10.1055/s-0039-1693427
[16]	Chen JD, Zhou XL, Tang XJ. An empirical feasibility study of societal risk classification toward bbs posts[J]. Journal of Systems Science and Systems Engineering, 2018, 27(6):709-726. doi: 10.1007/s11518-018-5372-x
[17]	Kim D, Seo D, Cho S, et al. Multi-co-training for docu-ment classification using various document representa-tions: TF-IDF, LDA, and Doc2Vec[J]. Information Scie-nces, 2018, 477: 15-29.
[18]	López-Santillán R, Montes-Y-Gómez M, González-Gurrola L C, et al. Richer Document Embeddings for Author Profiling tasks based on a heuristic search[J]. Information Processing & Management, 2020, 57(4): 102227. doi: 10.1016/j.ipm.2020.102227
[19]	Lauren P, Qu GZ, Zhang F, et al. Discriminant document embeddings with an extreme learning machine for clas-sifying clinical narratives[J]. Neurocomputing, 2018, 277: 129-138. doi: 10.1016/j.neucom.2017.01.117
[20]	Huang H, Wang Y, Feng C, et al. Leveraging Concep-tualization for Short-Text Embedding[J]. IEEE Transa-ctions on Knowledge & Data Engineering, 2018, 30(7): 1282-1295.
[21]	Park S, Lee J, Kim K. Semi-supervised distributed repre-sentations of documents for sentiment analysis[J]. Neural Networks, 2019, 119:139-150. doi: 10.1016/j.neunet.2019.08.001
[22]	Saeed R, Rubab S, Asif S, et al. An Automated System to Predict Popular Cybersecurity News using Document Embeddings[J]. Computer Modeling in Engineering and Sciences, 2021, 5:533-547.
[23]	Sinoara RA, Camacho-Collados J, Rossi RG, et al. Kn-owledge-enhanced document embeddings for text classif-ication[J]. Knowledge-Based Systems, 2019, 163: 955-971. doi: 10.1016/j.knosys.2018.10.026
[24]	Zhou P, Cao Z, Wu B, et al. EDM-JBW: A novel event detection model based on JS-ID’F-order and Bikmeans with word embedding for news streams[J]. Journal of Computational Science, 2017, 28:336-342. doi: 10.1016/j.jocs.2017.11.002
[25]	Wang H, Zuo Y, Li H, et al. Cross-domain recommenda-tion with user personality[J]. Knowledge-Based Systems, 2021, 213(8):106664. doi: 10.1016/j.knosys.2020.106664
[26]	Park S, Lee W, Lee J. Learning of indiscriminate distr-ibutions of document embeddings for domain adaptation[J]. Intelligent Data Analysis, 2019, 23(4):779-797. doi: 10.3233/IDA-184131
[27]	Ji X, Shen HW, Ritter A, et al. Visual Exploration of Neural Document Embedding in Information Retrieval: Semantics and Feature Selection[C]. IEEE Transactions on Visualization and Computer Graphics, 2019, 25(6): 2181-2192.
[28]	Jiao Z, Kwong S, Liu G, et al. PathEmb: Random Walk Based Document Embedding for Global Pathway Simil-arity Search[J]. IEEE Journal of Biomedical and Health Informatics, 2018, 23(3): 1329-1335. doi: 10.1109/JBHI.6221020
[29]	Yoon YC, Lee J, Park SY, et al. Fine-Grained Mobile Application Clustering Model Using Retrofitted Docu-ment Embedding[J]. ETRI Journal, 2017, 39(4): 443-454. doi: 10.4218/(ISSN)2233-7326
[30]	Gómez-Adorno H, Posadas-Durán J P, Sidorov G, et al. Document embeddings learned on various types of n-grams for cross-topic authorship attribution[J]. Comp-uting, 2018, 100(7):741-756.
[31]	Yoon B, Kim S, Kim S, et al. Doc2vec-based link pre-diction approach using SAO structures: application to patent network[J/OL]. Scientometrics, 2021, https://doi.org/10.1007/s11192-021-04187-4.
[32]	薛扬, 梁循, 谢华伦, 等. 基于最优文档嵌入的《红楼梦》作者辨析[J]. 中文信息学报, 2020(9): 97-110.
[33]	张涛, 王俊峰. 基于文本嵌入特征表示的恶意软件家族分类[J]. 四川大学学报(自然科学版), 2019, 56(3):441-449.
[34]	Truică C, Apostol E, Șerban M, et al. Topic-Based Doc-ument-Level Sentiment Analysis Using Contextual Cues[J]. Mathematics, 2021, 9(21):2722. doi: 10.3390/math9212722
[35]	Farnaghi M, Ghaemi Z, Mansourian A. Dynamic Spatio-Temporal Tweet Mining for Event Detection: A Case Study of Hurricane Florence[J]. 国际灾害风险科学学报(英文版), 2020, 3:16.
[36]	Wang Y, Li X, Zhang L L, et al. Configuring products with natural language: a simple yet effective approach based on text embeddings and multilayer perceptron[J/OL]. International journal of production research, 2021, https://www.tandfonline.com/doi/abs/10.1080/00207543.2021.1957508.
[37]	Zheng ZD, Zheng L, Garrett M, et al. Dual-path Con-volutional Image-Text Embeddings with Instance Loss[J]. ACM Transactions on Multimedia Computing Com-munications and Applications, 2020, 16(2): 51.
[38]	Jin RZ, Lu LF, Lee J, et al. Multi-representational conv-olutional neural networks for text classification[J]. Com-putational Intelligence, 2019, 35(3): 599-609.
[39]	Krendzelak M, Jakab F. Hierarchical Text Classification Using CNNs with Local Approaches[J]. Computing and Informatics, 2020, 39(5):907-924. doi: 10.31577/cai_2020_5_907
[40]	Liu X, Meehan J, Tong W, et al. DLI-IT: A Deep Learning Approach to Drug Label Identification Through Image and Text Embedding[J]. BMC Medical Informatics and Decision Making, 2020, 20(1): 68. doi: 10.1186/s12911-020-1078-3 pmid: 32293428
[41]	马建红, 张少光, 曹文斌, 等. 面向功能信息的相似专利动态聚类混合模型[J]. 计算机应用与软件, 2021, 38(5):201-207.
[42]	Zhang JD, Chow CY. MOCA: Multi-Objective, Collabo-rative, and Attentive Sentiment Analysis[J]. IEEE Access, 2019, 7: 10927-10936. doi: 10.1109/Access.6287639
[43]	Zhang D, Yin C, Zeng J, et al. Combining structured and unstructured data for predictive models: a deep learning approach[J]. BMC Medical Informatics and Decision Making, 2020, 20(1): 280. doi: 10.1186/s12911-020-01297-6 pmid: 33121479
[44]	Ke Z, Zeng J, Yu L, et al. Deep sentiment hashing for text retrieval in social CIoT[J]. Future Generation Computer Systems, 2018, 86(SEP.):362-371. doi: 10.1016/j.future.2018.03.047
[45]	何永强, 秦勤, 王俊鹏. 基于深度神经网络的嵌入式向量及话题模型[J]. 计算机工程与设计, 2016, 37(12): 3384-3388, 3399.
[46]	张良君. 基于Word2Vec词嵌入和双向LSTM模型对用户回答文本进行分类[J]. 电子技术与软件工程, 2021,(14): 208-211.
[47]	Zha Z, Liu J, Chen D, et al. Adversarial Attribute-Text Embedding for Person Search With Natural Langu-age Query[J]. IEEE transactions on multimedia, 2020, 22(7): 1836-1846. doi: 10.1109/TMM.6046
[48]	肖琳, 陈博理, 黄鑫, 等. 基于标签语义注意力的多标签文本分类[J]. 软件学报, 2020, 31(4): 1079-1089.
[49]	Zhou C, Zhao J, Ren C. SUDIR: An Approach of Sensing Urban Text Data From Internet Resources Based on Deep Learning[J]. IEEE Access, 2020, 8:214454-214468. doi: 10.1109/Access.6287639
[50]	陈强, 越孙红. 融合BERT词嵌入和注意力机制的中文文本分类[J]. 小型微型计算机系统, 2022, 43(1): 22-26.
[51]	Silveira B, Silva H S, Murai F, et al. Predicting user emo-tional tone in mental disorder online communities[J]. Future Generation Computer Systems, 2021, 125: 641-651. doi: 10.1016/j.future.2021.07.014
[52]	张万杰. 引入标签语义信息的多标签文本分类[J]. 信息技术与信息化, 2021, (8):8-11.
[53]	陈峥, 任建坤, 袁浩瑞. 融合序列和图结构的机器阅读理解[J]. 中文信息学报, 2021, 35(04):120-128.
[54]	Wei C, Luo S, Guo J, et al. Discriminative Locally Doc-ument Embedding: learning a smooth affine map by approximation of the probabilistic generative structure of subspace[J]. Knowledge-Based Systems, 2017, 121 (APR.1):41-57. doi: 10.1016/j.knosys.2017.01.012
[55]	Niu Y, Zhang H, Li J. A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings[J]. Applied sciences, 2021, 11(18):8708. doi: 10.3390/app11188708
[56]	Li S, Zhang Y, Pan R. Bi-Directional Recurrent Atten-tional Topic Model[J]. ACM Transactions on Knowledge Discovery from Data, 2020, 14(6):1-30.
[57]	Xiao Y, Liu J, Kang G, et al. LDNM: A General Web Service Classification Framework via Deep Fusion of Structured and Unstructured Features[J]. IEEE Trans-actions on Network and Service Management, 2021, 18(3):3858-3872.
[58]	Harris ZS. Distributional Structure[J]. Word, 1954, 10(2-3):146-162. doi: 10.1080/00437956.1954.11659520
[59]	Jones S, KAREN. A statistical interpretation of term spe-cificity and its application in retrieval[J]. Journal of Docu-mentation, 1972, 28(1):11-21.
[60]	Arora S, Liang Y, Ma T. A Simple But Tough-To-Beat Baseline For Sentence Embeddings[C]. In ICLR, 2017: 1-6.
[61]	Zhao R, Mao K. Fuzzy Bag-of-Words Model for Docu-ment Representation[J]. IEEE Transactions on Fuzzy Systems, 2018, 26(2):794-804. doi: 10.1109/TFUZZ.2017.2690222
[62]	Lakshmi R, Baskar S. Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms[J]. Expert Systems with Applications, 2019, 137:493-503. doi: 10.1016/j.eswa.2019.07.022
[63]	Wei W, Guo C, Chen J, et al. CCODM: conditional co-occurrence degree matrix document representation meth-od[J]. Soft Computing, 2019, 23(4):1239-1255. doi: 10.1007/s00500-017-2844-8
[64]	Rumelhart DE, Hinton GE, Williams RJ. Learning Repr-esentations by Back Propagating Errors[J]. Nature, 1986, 323(6088):533-536. doi: 10.1038/323533a0
[65]	Elman JL. Distributed representations, simple recurrent networks, and grammatical structure[J]. Machine Learn-ing, 1991, 7(2): 195-225.
[66]	Glenberg AM, Robertson DA. Symbol Grounding and Meaning: A Comparison of High-Dimensional and Emb-odied Theories of Meaning[J]. Journal of Memory and Language, 2000, 43(3):379-401. doi: 10.1006/jmla.2000.2714
[67]	Mikolov T, Karafiat´ M, Burget L, et al. Recurrent neural network based language model[C]. Conference of the International Speech Communication Association, 2010: 1045-1048.
[68]	Blei DM, Ng A, Jordan MI. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3(4-5): 993-1022.
[69]	Bengio Y, Ducharme R, Vincent P, et al. A Neural Proba-bilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[70]	Wei C, Zhu L, Shi J. Short Text Embedding Autoencoders With Attention-Based Neighborhood Preservation[J]. IEEE Access, 2020, 8(99):1. doi: 10.1109/Access.6287639
[71]	Aubaid AM, Mishra A. A Rule-Based Approach to Em-bedding Techniques for Text Document Classification[J]. Applied Sciences, 2020, 10(11): 4009. doi: 10.3390/app10114009
[72]	Pennington J, Socher R, Manning C. Glove: Global Vec-tors for Word Representation[C]. Conference on Empi-rical Methods in Natural Language Processing, 2014: 1532-1543.
[73]	Bojanowski P G E J. Enriching Word Vectors with Sub-word Information[J]. Transactions of the Association for Computational Linguistics, 2017, 5:135-146. doi: 10.1162/tacl_a_00051
[74]	范昊, 李鹏飞. 基于FastText字向量与双向GRU循环神经网络的短文本情感分析研究——以微博评论文本为例[J]. 情报科学, 2021, 39(4):15-22.
[75]	Fukushima K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position[J]. Biological Cybernetics, 1980, 36(4):193-202. doi: 10.1007/BF00344251 pmid: 7370364
[76]	Kim Y. Convolutional Neural Networks for Sentence Cl-assification[J]. 2014: arXiv: 1408.5882[cs.CL].
[77]	Kalchbrenner N, Grefenstette E, Blunsom P. A Convo-lutional Neural Network for Modelling Sentences[J/OL]. Eprint Arxiv, 2014, https://doi.org/10.48550/arXiv.1404.2188.
[78]	Socher R, Perelygin A, Wu JY, et al. Recursive Deep Mo-dels for Semantic Compositionality Over a Sentiment Tree-bank[C]. Proceedings of the 2013 Conference on Emp-irical Methods in Natural Language Processing, 2013: 1631-1642.
[79]	Zhou SK, Rueckert D, Fichtinger G. Handbook of Med-ical Image Computing and Computer Assisted Interven-tion[M]. Academic Press, 2020:503-519.
[80]	Liu C, Zhao S, Volkovs M. Unsupervised Document Em-bedding With CNNs[J]. 2017: arXiv:1711.04168v3 [cs.CL].
[81]	Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780. doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[82]	Papastratis I, Dimitropoulos K, Konstantinidis D, et al. Continuous Sign Language Recognition through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space[J]. IEEE Access, 2020, 8: 91170-91180. doi: 10.1109/Access.6287639
[83]	Liu G, Guo J. Bidirectional LSTM with attention mech-anism and convolutional layer for text classification[J]. Neurocomputing, 2019, 337:325-338. doi: 10.1016/j.neucom.2019.01.078
[84]	Cho K, Merrienboer B V, Gulcehre C, et al. Learning Phr-ase Representations using RNN Encoder-Decoder for Statis-tical Machine Translation[J]. Computer Science, 2014. arXiv:1406.1078v3 [cs.CL].
[85]	Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained bio-medical language representation model for biomedical text mining[J]. Bioinformatics, 2019, 36(4): 1234-1240. doi: 10.1093/bioinformatics/btz682
[86]	Peters M, Neumann M, Iyyer M, et al. Deep Contextu-alized Word Representations[J]. 2018: arXiv preprint arXiv:1802.05365[cs.CL].
[87]	Carvallo A, Parra D, Lobel H, et al. Automatic document screening of medical literature using word and text embe-ddings in an active learning setting[J]. Scientometrics, 2020, 125(3): 3047-3084. doi: 10.1007/s11192-020-03648-6
[88]	Deerwester S, Dumais ST, Furnas GW, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. doi: 10.1002/(ISSN)1097-4571
[89]	Hofmann T. Probabilistic Latent Semantic Analysis[J]. 2013: arXiv:1301.6705v1 [cs.LG].
[90]	Kazuma, Hashimoto, Georgios, et al. Topic detection using paragraph vectors to support active learning in system-atic reviews.[J]. Journal of biomedical informatics, 2016, 62(C):59-65.
[91]	Jiang BB, Li ZY, Chen HH, et al. Latent Topic Text Re-presentation Learning on Statistical Manifolds[J]. IEEE Transactions on Neural Networks & Learning Systems, 2018, 29(11): 5643-5654.
[92]	Blei DM, McAuliffe JD. Supervised topic models[C]. Proceedings of the 20th International Conference on Neural Information Processing Systems, 2007: 121-128.
[93]	Zhu J, Ahmed A, Xing EP. MedLDA: maximum margin supervised topic models[J]. The Journal of Machine Lea-rning Research, 2012, 13: 2237-2278.
[94]	Cai D, Mei QZ, Han JW, et al. Modeling hidden topics on document manifold[C]. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008: 911-920.
[95]	Cai D, Wang XH, He XF. Probabilistic dyadic data anal-ysis with local and global consistency[C]. Proceedings of the 26th Annual International Conference on Machine Learning, 2009, Montreal, Quebec, Canada, June 14-18, 2009: 105-112.
[96]	Huh S, Fienberg SE. Discriminative Topic Modeling Bas-ed on Manifold Learning[J]. ACM transactions on know-ledge discovery from data, 2011, 5(4) SI: 20.
[97]	Chao W, Luo S, Ma X, et al. Locally Embedding Auto-encoders: A Semi-Supervised Manifold Learning App-roach of Document Representation[J]. PLOS ONE, 2016, 11(1):e146672.
[98]	Rahimi Z, Homayounpour MM. Tens-embedding A Te-nsor based document embedding method[J]. Expert Syst-ems with Applications, 2020, 162:113770.
[99]	张爽, 刘非凡, 罗双玲, 等. 基于领域语义地图的区块链研究主题发现及演化分析[J]. 情报工程, 2021, 7(02):3-14.
[100]	Lee H, Yoon Y. Engineering doc2vec for automatic classification of product descriptions on O2O applications[J]. Electronic Commerce Research, 2018, 18(3):1-24. doi: 10.1007/s10660-018-9290-2
[101]	王宇晗, 林民, 李艳玲, 赵佳鹏. 基于BERT的嵌入式文本主题模型研究[J/OL]. 计算机工程与应用:1-13.[2022-06-21]. http://kns.cnki.net/kcms/detail/11.2127.TP.20211026.1535.010.html.
[102]	Jiang ZH, Yu W, Zhou D, Chen Y, Feng J, Yan S. Co-nvbert: Improving bert with span-based dynamic convo-lution[J]. 2020: arXiv:2008.02496 [cs.CL].
[103]	He P, Liu X, Gao J, Chen W. Deberta: Decodingenhanced bert with disentangled attention[J]. 2020: arXiv: 2006. 03654 [cs.CL].
[104]	Berhane F. Building a recurrent neural network - step by step - v1[EB/OL]. [2022-4-12]. https://datascience-enthusiast.com/DL/Building_a_Recurrent_Neural_Netw-ork-Step_by_Step_v1.html.

方法	模型	优点	缺点
基于频率的方法：每个文档都表示为一个单词计数的向量，其中向量的长度等于词汇量的大小，并且每个元素都显示相应词汇在词汇量中的出现频率	BOW、TF-IDF	非常简单快速人类可以解释的适用于线性分类器	无法说明单词的顺序；向量非常稀疏，可能会发生维数诅咒； BOW无法表示文档的语义属性以及文档中最相关的术语之间的关系； BOW仅考虑单词在文档中出现的次数，不考虑单词在整个语料库中出现的次数
基于神经网络的方法：这些方法试图借助卷积神经网络，递归神经网络，递归神经网络（RNN）或长短期记忆（LSTM）等神经网络来生成文档表示形式	Skip-Gram、CBOW、Word2vec、Doc2Vec、GloVe、FastText、LSTM、BERT、ELMo	正确捕获文档的语义、在RNN或LSTM网络中特别考虑单词顺序、使用注意力机制考虑了较长的上下文内容、向量表示比较稠密（dense）、一定程度解决“语义鸿沟”问题	很难训练；他们中的大多数只考虑本地上下文中单词的第一级共现；仅将文档视为单词序列，不使用主题等其他有用功能
基于主题建模的方法：例如LSI，LDA，PLSI，这些方法考虑了用于文档建模的概率模型	LSI、LSA、PLSI、PLSA、LDA、LTTR	将具有相同主题的文档彼此靠近；为文本分类任务提供良好的泛化能力	无法说明单词的顺序；适当数量的主题具有挑战性由于将文档转移到主题空间中，因此忽略了一些重要功能，例如该文档中的稀有单词；仅考虑全局上下文中单词的同时出现；仅按主题显示文档，不考虑单词

模型	优点	缺点
CBOW	与skipgram模型相比更快很好地代表了常用词	忽略词的形态信息和多义性没有拼写错误和罕见词的嵌入
Skip-Gram	对小型训练数据集有效很好地表示不常用的词	忽略词的形态信息和多义性没有拼写错误和罕见词的嵌入
PV-DM	单独为许多任务提供了良好的结果	与 PV-DBOW 相比，需要更多内存，因为需要存储 Softmax 权重和词向量
PV-DBOW	只需要存储词向量，因此需要更少的内存与PV-DM相比，它简单快捷	需要与 PV-DM 一起使用，以提供跨任务的一致结果
GloVe	结合word2vec模型在基于上下文的学习表示以及利用全局共现统计的矩阵分解方法方面的优势	忽略词的形态信息和多义性没有拼写错误和罕见词的嵌入
FastText	在词向量中编码形态信息拼写错误和罕见词的嵌入 157 种语言的预训练词向量	计算密集型和内存需求随着语料库的大小而增加忽略单词的多义性
ELMo	生成上下文相关的向量表示，从而解释单词的多义性拼写错误和罕见词的嵌入	计算密集，因此需要更多的训练时间
BERT	迁移学习双向特征融合能力强相比于LSTM更好地解决长序列的信息提取支持注意力机制	参数多不容易训练预训练和微调不匹配

文本嵌入技术的研究与应用进展

Progress in Research and Application of Text Embedding Technology

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献 104

相关文章 2

编辑推荐

Metrics

本文评价

[1]	李东闻,钟震宇,申峻宇,王昊天,孙羽菲,张玉志. NKCorpus：利用海量网络数据构建大型高质量中文数据集[J]. 数据与计算发展前沿, 2022, 4(3): 30-45.
[2]	刘晓东,倪浩然. 深度学习技术在学科融合研究中的应用[J]. 数据与计算发展前沿, 2020, 2(5): 99-109.