Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

doi:10.11871/jfdc.issn.2096-742X.2021.01.009

Frontiers of Data and Computing ›› 2021, Vol. 3 ›› Issue (1): 112-121.

doi: 10.11871/jfdc.issn.2096-742X.2021.01.009

• Technology and Applicaton • Previous Articles

Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

ZHAI Qingchen^1,³(),ZHOU Yuanchun¹(),SONG Qiucheng¹(),WANG Jianwei²(),MENG Zhen^1,^*(),ZHANG Yanling^2,^*()

1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
2. Zhengzhou Tobacco Research of CNTC,Zhengzhou,Henan 450001,China
3. University of Chinese Academy of Sciences,Beijing 100049,China

Received:2020-11-24 Online:2021-02-20 Published:2021-02-07
Contact: MENG Zhen,ZHANG Yanling E-mail:zhaiqingchen@cnic.cn;zyc@cnic.cn;sqc@cnic.cn;wangjw@ztri.com.cn;zhenm99@cnic.cn;zhangyanling@ztri.com.cn

Abstract

Abstract:

[Objective] In order to measure and classify the similarity of tobacco leaves and to overcome the problem of invalid distance measurement in high-dimensional space, [Methods] the article uses three methods: variance weighted method, principal component analysis method, and local linear embedding method to reduce the dimension and filter the attributes of tobacco leaves. The similarity is calculated using the selected indicators and K-means algorithm is carried out to cluster tobacco leaves. [Results] By analyzing the cluster silhouette coefficient of the input indicators obtained by different methods, it is found that the clustering effects of the four indicators of total alkaloids, reducing sugars, potassium, and stalk rate selected by the variance weighted method are better than those of other two dimension reduction algorithms. The clustering algorithm divides tobacco leaves into four categories and analyzes the characteristics of various input indicators. And the results obtained through the similarity algorithm are mutually verified with the existing research in the industry at three similar counties: Qilin district, Xuanwei county, and Luoping county. [Conclusions] Based on the machine learning algorithm, this article digs out the similarities between the characteristic indicators in the sensory data of tobacco leaf and the place of origin, providing indicators and conclusions with a certain reference value for the tobacco industry production. It also provides an algorithmic basis for the application of machine learning in the tobacco industry.

Key words: data mining, similarity, clustering, tobacco leaf

ZHAI Qingchen,ZHOU Yuanchun,SONG Qiucheng,WANG Jianwei,MENG Zhen,ZHANG Yanling. Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis[J]. Frontiers of Data and Computing, 2021, 3(1): 112-121.

Figures/Tables 7

Table 1

Fig.1

Fig.2

Table 2

Fig.3

Fig.4

Table 3

References 13

[1]	李树深. 数据与计算是科技创新的巨大驱动力[J]. 数据与计算发展前沿, 2019,1(1):1.
[2]	孙哲南, 张兆翔, 王威, 刘菲, 谭铁牛. 2019年人工智能新态势与新进展[J]. 数据与计算发展前沿, 2019,1(2):1-16.
[3]	常爱霞, 杜咏梅, 付秋娟, 等. 烤烟主要的化学成分与感官质量的相关性分析[J]. 中国烟草科学, 2009,30(6):9-12.
[4]	王彦棡, 王珏, 曹荣强. 人工智能计算与数据服务平台的研究与应用[J]. 数据与计算发展前沿, 2019,1(2):86-97.
[5]	中国烟草总公司郑州烟草研究院.中国烟叶质量白皮书:2006年-2010年[R/OL]. [2020-06-17].
[6]	曹鹏云, 付秋娟, 宫会丽, 杨宁. 高维空间下烟叶质量相似性度量方法研究[J]. 中国烟草科学, 2013,34(03):84-88.
[7]	杨宁. 计算机辅助卷烟配方设计关键技术研究[D]. 青岛:中国海洋大学, 2010.
[8]	Kulis B. Metric learning: A survey[J]. Foundations and Trends in Machine Learning, 2012,5(4):287-364.
[9]	周志华. 机器学习[M]. 清华大学出版社, 2016: 425.
[10]	Alexandre L.M Levada. Parametric PCA for unsupervised metric learning[J]. Pattern Recognition Letters, 2020,135. pmid: 32406416
[11]	王微. 融合全局和局部信息的度量学习方法研究[D]. 中国科学技术大学, 2014.
[12]	鞠玲, 王正群, 徐春林, 杨洋. 基于Kernel Rank-order距离的重构权重局部线性嵌入算法[J]. 计算机应用与软件, 2020,37(08):149-155+206.
[13]	唐徐红, 矣跃平, 袁仕信, 等. 指纹图谱技术在云南省烤烟质量分类中的应用研究[J]. 湖北农业科学, 2012,51(06):1156-1160.

成份	初始特征值
成份	合计	方差的 %	累积 %
1	4.368	33.598	33.598
2	2.518	19.368	52.967
3	1.071	8.235	61.201
4	0.969	7.454	68.655
5	0.794	6.104	74.759
6	0.680	5.230	79.989
7	0.577	4.439	84.428
8	0.568	4.371	88.799
9	0.484	3.724	92.523
10	0.382	2.940	95.463
11	0.250	1.926	97.389
12	0.226	1.739	99.128
13	0.113	0.872	100.000

属性	类1	类2	类3	类4
产地数	123	10	51	24
还原糖 %	27.92	29.87	23.63	24.83
含梗率 %	30.84	23.63	32.37	27.45
钾 %	2.02	1.48	2.04	1.66
总植物碱 %	2.22	1.60	2.53	2.33

产地	相近产地
麒麟区	宜良县、宣威县、江华瑶族自治县、宁远县、罗平县
宣威县	保康县、罗平县、陆良县、麒麟区、腾冲县
罗平县	宣威市、利川市、永兴县、保康县、腾冲县

Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 13

Related Articles 8

Recommended Articles

Metrics

Comments

[1]	TAO Lei,SU Chenyang,LI Zhengdan,ZHU Jingwen,ZHANG Yuzhi. Educational Resource Search Strategy Based on ElasticSearch and Semantic Similarity Matching [J]. Frontiers of Data and Computing, 2022, 4(2): 50-62.
[2]	ZHANG Meng,LI Jian. Bird Audio Data Preprocessing Method [J]. Frontiers of Data and Computing, 2021, 3(5): 130-140.
[3]	LI Yan,CHEN Yuanping. Research on Resource Recommendation Technology of Scientific Research Information Portal [J]. Frontiers of Data and Computing, 2021, 3(2): 112-119.
[4]	LIU Chunyu,SHI Zhuomin,YU Jianjun. Tree Model Based Prediction of Financial Reimbursement Approval [J]. Frontiers of Data and Computing, 2021, 3(2): 60-67.
[5]	Yang Runjia,Liu Zesan. An Data Mining Algorithm for Analyzing Industrial Alarm Data Correlation [J]. Frontiers of Data and Computing, 2020, 2(5): 110-121.
[6]	Ge Yinchi,Zhang Hui,Song Wenyan,Wang Xuan. Scientific and Technology Resource Clustering Based on Domain Ontology [J]. Frontiers of Data and Computing, 2020, 2(5): 13-22.
[7]	Dong Jiayuan,Yang Xiaoyu. Integration and Optimization of Material Data Mining and Machine Learning Tools [J]. Frontiers of Data and Computing, 2020, 2(4): 105-120.
[8]	Wang Guoyin, Yu Hong. Multi-Granularity Cognitive Computing—A New Model for Big Data Intelligent Computing [J]. Frontiers of Data and Computing, 2019, 1(2): 75-85.