Frontiers of Data and Computing ›› 2021, Vol. 3 ›› Issue (1): 112-121.

• Technology and Applicaton •

Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

ZHAI Qingchen1,3(),ZHOU Yuanchun1(),SONG Qiucheng1(),WANG Jianwei2(),MENG Zhen1,*(),ZHANG Yanling2,*()

1. 1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
2. Zhengzhou Tobacco Research of CNTC,Zhengzhou,Henan 450001,China
3. University of Chinese Academy of Sciences,Beijing 100049,China
• Received:2020-11-24 Online:2021-02-20 Published:2021-02-07
• Contact: MENG Zhen,ZHANG Yanling E-mail:zhaiqingchen@cnic.cn;zyc@cnic.cn;sqc@cnic.cn;wangjw@ztri.com.cn;zhenm99@cnic.cn;zhangyanling@ztri.com.cn

Abstract:

[Objective] In order to measure and classify the similarity of tobacco leaves and to overcome the problem of invalid distance measurement in high-dimensional space, [Methods] the article uses three methods: variance weighted method, principal component analysis method, and local linear embedding method to reduce the dimension and filter the attributes of tobacco leaves. The similarity is calculated using the selected indicators and K-means algorithm is carried out to cluster tobacco leaves. [Results] By analyzing the cluster silhouette coefficient of the input indicators obtained by different methods, it is found that the clustering effects of the four indicators of total alkaloids, reducing sugars, potassium, and stalk rate selected by the variance weighted method are better than those of other two dimension reduction algorithms. The clustering algorithm divides tobacco leaves into four categories and analyzes the characteristics of various input indicators. And the results obtained through the similarity algorithm are mutually verified with the existing research in the industry at three similar counties: Qilin district, Xuanwei county, and Luoping county. [Conclusions] Based on the machine learning algorithm, this article digs out the similarities between the characteristic indicators in the sensory data of tobacco leaf and the place of origin, providing indicators and conclusions with a certain reference value for the tobacco industry production. It also provides an algorithmic basis for the application of machine learning in the tobacco industry.