Research on Topic Recognition and Analysis Based on LDA and Move Tagging

doi:10.11871/jfdc.issn.2096-742X.2023.05.009

Abstract

Abstract:

[Objective] From the two dimensions of topic representation word extraction and topic sentence function classification, this paper demonstrates a new topic analysis method based on Latent Dirichlet Allocation (LDA) model and move tagging, and explores the effectiveness and practicality of the method. [Methods] LDA model is used to identify the topic, and the Sentence Transformer model is used to extract the subject phrases. Meanwhile, a sentence function classification model is constructed to annotate the steps, identify the functional types of text sentences, and analyze the topic content from the perspective of sentence function. [Results] Based on the data of papers in the field of agricultural resources and environment, the empirical study shows that, compared with the traditional LDA model, the identified subject characterizing words are more readable and explanatory, and further combined with the step annotation, the content analysis of the subject sentence is more in-depth. [Limitations] There is a problem that the extended content of the subject phrase token words are of the same meaning. It is necessary to further improve by integrating the subject phrase token words with the same meaning. [Conclusions] The proposed method in this study achieves a good effect on topic representation word extraction and topic content analysis, which can improve the efficiency and depth of text topic mining analysis.

Key words: LDA model, move tagging, subject phrase, subject analysis

ZHANG Hui, CHUAN Limin, ZHENG Huaiguo, ZHAO Jingjuan, QI Shijie. Research on Topic Recognition and Analysis Based on LDA and Move Tagging[J]. Frontiers of Data and Computing, 2023, 5(5): 107-118, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2023.05.009.

Figures/Tables 13

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Fig.5

Fig.6

Table 2

Table 3

Table 4

Table 5

SciBERT move classification annotation results"

序号

标注结果

[研究背景]Rain-fed agriculture in central Spain is mostly water limited. [研究背景]Reservoir tillage (RT) can increase soil water content, thus helping overcome most factors limiting crop production in this region. [objective]The aim of this study was to investigate the short-term effects of two tillage practices on soil physical properties and water availability where rain-fed barley was being grown. [研究方法]A field experiment was established on a loamy soil for comparing RT and minimum tillage (MT). [研究方法]Bulk density and volumetric water content were measured in 5-cm increments to a depth of 30 cm. [研究方法]The soil water tension was monitored using a wireless sensor network with sensors at 10, 20, and 30 cm depths. [结果]The results showed that bulk density in RT and MT treatments at all soil depths were statistically similar. [结果]However, soil water tensions in the MT treatment were substantially higher than that in RT in the entire observation period at all depths. [结论]In conclusion, RT could be used to minimise risks from crop failure during the poorer rainy seasons, and it showed increased soil water retention and improved barley yield.

[研究背景]Bacterial species of the genus Acidithiobacillus have been proved to reduce soil salinity, but their chemical impact on a saline compost has not been reported. [研究背景]This work is aimed to evaluate the effects of A. thiooxidans on the chemical properties of a saline compost formulation. [结果]When added to a microcosm under nonsterile conditions, a sulfur-supplemented A. thiooxidans inoculum lowered pH to 7.16, reduced exchangeable sodium levels to 97 cmol kg(-1), and increased electrical conductivity to 7.93 dS m(-1) after 720 h. No differences in bacterial fingerprint were observed after adding A. thiooxidans, either supplemented with sulfur or not. [结论]These results suggest that inoculating saline composts with sulfur-oxidizing bacterial strains of the genus Acidithiobacillus could help to decrease salinity-associated parameters, and it has the potential of reducing the impact of prolonged application of organic fertilizers on the soil. [结论]Additionally, it could be a valuable tool to remediate high-salinity soils.

Table 5

Table 6

Table 7

References 21

[1]	李璐萍, 赵小兵. 基于文本聚类的主题发现方法研究综述[J]. 情报探索, 2020(11): 121-127.
[2]	CALLON M, COURTIAL J P, TURNER W A, et al. From Translations to Problematic Networks: An Introduction to Co-word Analysis[J]. Social Science Information, 1983, 22(2): 191-235. doi: 10.1177/053901883022002003
[3]	郭崇慧, 曹梦月. GMAP: 一种基于AP聚类的共词分析方法[J]. 情报学报, 2017, 36(11): 1192-1200.
[4]	李锋. 基于核心关键词的聚类分析——兼论共词聚类分析的不足[J]. 情报科学, 2017, 35(8): 68-71,78.
[5]	闫涛. 基于共现分析的文本表示方法研究[D]. 太原: 山西大学, 2021.
[6]	田鹏伟, 张娴. 基于异构信息网络融合的专利技术主题识别研究[J]. 情报杂志, 2021, 40(8): 45-52.
[7]	丁敬达, 陈一帆, 刘超, 等. 基于共词和Word2Vec加权向量的文献-主题语义匹配分析方法[J]. 图书情报工作, 2022, 66(12): 108-116. doi: 10.13266/j.issn.0252-3116.2022.12.010
[8]	张琴, 张智雄. 基于PhraseLDA模型的主题短语挖掘方法研究[J]. 图书情报工作, 2017, 61(8):120-125. doi: 10.13266/j.issn.0252-3116.2017.08.015
[9]	TAJBAKHSH M S, BAGHERZADEH J. Semantic Knowledge LDA with Topic Vector for Recommending Hashtags: Twitter Use Case[J]. Intelligent Data Analysis, 2019, 23(3): 609-622. doi: 10.3233/IDA-183998
[10]	赵林静. 结合语义相似度改进LDA的文本主题分析[J]. 计算机工程与设计, 2019, 40(12): 3514-3519.
[11]	王红斌, 王健雄, 张亚飞, 等. 主题不平衡新闻文本数据集的主题识别方法研究[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[12]	张晨晨. 基于LDA模型的舆情情感主题研究[D]. 阜阳: 阜阳师范大学, 2022.
[13]	SWALES J M. Research Genres: Explorations and Applications[M]. Cambridge: Cambridge University Press, 2004: 228-229.
[14]	陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15]	王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4 (6) :66-68.
[16]	欧石燕, 陈嘉文. 科学论文全文语步自动识别研究[J]. 现代情报, 2021, 41(11):3-11. doi: 10.3969/j.issn.1008-0821.2021.11.001
[17]	赵旸, 张智雄, 刘欢, 等. 基金项目摘要的语步识别系统设计与实现[J]. 情报理论与实践, 2022, 45(8): 162-168.
[18]	郭航程, 何彦青, 兰天, 等. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究[J]. 数据分析与知识发现, 2022, 6(Z1): 298-307.
[19]	NILS R, IRYNA G. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks[J]. CoRR, 2019. arXiv: 1908. 10084: 1908.10084.
[20]	BRUENING B. Idioms, Collocations, and Structure[J]. Natural Language & Linguistic Theory, 2020, 38:365-424.
[21]	BELTAGY I, LO K, COHAN A. SciBERT: A Pretrained Language Model for Scientific Text[J]. 2019. arXiv: 1903.10676.

参数名称	参数设置
学习率	2e-5
epsilon递减策略	1e-8
迭代次数	3
批次大小	128

主题	表征词
主题0	plant, root, abundance, diversity, rhizosphere, microbial, alter, biomass, decomposition, isolate
主题1	model, area, sample, sediment, predict, map, soil erosion, soil properties, erosion, parameter
主题2	treatment, concentration, biochar, content, crop, plant, fertilizer, Cd, maize, nutrient
主题3	soil quality, SOC, depth, content, plot, aggregate, soil properties, tillage, layer, cover crop

原主题词	扩展主题词1	扩展主题词2	扩展主题词3
plant	plant traits	whole plant effects	plant variables
root	variable root traits	root traits	root interactions
abundance	protist abundance	abundance effects	pathogen abundance
diversity	protist diversity	diversity functional traits	diversity effect
rhizosphere	negative rhizosphere effect	rhizosphere competence	rhizosphere effect
microbial	microbial traits	microbial residual responds	microbial residues
alter	alter plant growth	alter soil water fluxes	alter soil ammonia oxidizer
biomass	system traits biomass	biomass adaptation	biomass parameters
decomposition	decomposition pathways	SOM decomposition	decomposition reacts
isolate	bacterial isolate	irregular isolate	rhizosphere competent isolate

标签名称	准确率	召回率	F1值
研究背景	0.6731	0.8861	0.765
研究目标	0.8684	0.5323	0.66
研究方法	0.8726	0.8671	0.8698
结果	0.8257	0.8451	0.8353
结论	0.8404	0.798	0.8187
平均值	0.6731	0.8861	0.765
加权平均值	0.8404	0.798	0.8187

模型名称	平均准确率	平均召回率	平均F1值
SciBERT	0.7839	0.7709	0.7719
改进模型	0.8404	0.798	0.8187