数据降维及聚类算法在烟叶相似性分析中的应用

doi:10.11871/jfdc.issn.2096-742X.2021.01.009

数据与计算发展前沿 ›› 2021, Vol. 3 ›› Issue (1): 112-121.

doi: 10.11871/jfdc.issn.2096-742X.2021.01.009

• 技术与应用 • 上一篇

数据降维及聚类算法在烟叶相似性分析中的应用

翟擎辰^1,³(),周园春¹(),宋秋成¹(),王建伟²(),孟珍^1,^*(),张艳玲^2,^*()

1.中国科学院计算机网络信息中心,北京 100190
2.中国烟草总公司郑州烟草研究院,河南郑州 450001
3.中国科学院大学,北京 100049

收稿日期:2020-11-24 出版日期:2021-02-20 发布日期:2021-02-07
通讯作者: 孟珍,张艳玲
作者简介:翟擎辰,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,研究方向为机器学习、无监督学习、表征学习等。本文中负责数据处理、模型构建、实验、论文撰写。
ZHAI Qingchen is a graduate student of Computer Network Information Center of Chinese Academy of Sciences. His research directions are machine learning, unsupervised learning, representation learning, etc.In this paper, he is responsible for data processing, model construction, experiments, and paper writing.E-mail: zhaiqingchen@cnic.cn|周园春,中国科学院计算机网络信息中心,副主任,博士,研究员,博士生导师,中国科学院特聘研究员,中心学位评定委员会主席,大数据应用与技术发展部主任,大数据分析与计算技术国家地方联合工程实验室秘书长,中科院信息化专项科学大数据工程负责人。发表SCI/EI收录论文90多篇。主要研究方向为大数据分析与处理。本文主要承担工作为数据挖掘算法研究框架设计。
ZHOU Yuanchun is the research fellow, Ph.D. supervisor and the assistant director in Computer Network Information Center, Chinese Academy of Sciences and the director of the Department of Big Data Technology and Application Development. He is also the chairman of the Degree Evaluation Committee in Computer Network Information Center, Chinese Academy of Sciences. His research interests include big data analysis and processing. He has published more than 90 SCI/EI papers.In this paper, he is responsible for the design of data mining algorithm research framework.E-mail: zyc@cnic.cn|宋秋成,中科创嘉公司,具有丰富的从事前端开发、数据可视化等工作的经验。本文中负责数据可视化。
SONG qiucheng, employee of Zhongke Chuangjia Company, has rich experience in front-end development, data visualization, etc.In this paper, he is responsible for data visualization.E-mail: sqc@cnic.cn|王建伟,中国烟草总公司郑州烟草研究院,硕士,高级农艺师,硕士研究生导师,主要研究方向为烟叶生产技术与烟叶质量大数据应用。本文中负责对烟叶相似性结果以烟草行业角度进行评估。
WANG Jianwei, master, is a senior agronomist and master tutor of Zhengzhou Tobacco Research Institute of China National Tobacco Corporation. His main research directions are the tobacco leaf production technology and big data applications of tobacco leaf quality.In this paper, he is responsible for evaluating the similarity results of tobacco leaves from the perspective of the tobacco industry.E-mail: wangjw@ztri.com.cn|孟珍,中国科学院计算机网络信息中心,高级工程师,硕士研究生导师,大数据技术与应用发展部数据资源与应用实验室副主任,主要研究方向为多源异构数据的融合管理与关联技术、面向领域大数据分析模型与云服务技术。发表SCI/EI收录论文20多篇。本文中负责数据挖掘算法研究框架设计与指导。
MENG Zhen is a senior engineer and the master supervisor at the Department of Big Data Technology and Application Development at Computer Network Information Center, Chinese Academy of Sciences. She is the deputy director of the Resource and Application Lab at the Department of Big Data Technology and Application Development. Her research interests include big data management, processing, mining, analysis and other related technologies. And she has published over 20 papers included in SCI/EI.In this paper, she is mainly responsible for the overall design and guidance of data mining algorithm research frameworks.E-mail: zhenm99@cnic.cn|张艳玲,烟草行业生态环境与烟叶质量重点实验室,副主任,博士,硕士生导师,主要研究方向为烟叶质量评价、大数据技术在烟叶质量评价中的应用。本文中负责创新平台在烟叶质量评价中的典型应用。
ZHANG Yanling, Ph.D., is a researcher and deputy director of the key laboratory of environmental and tobacco leaf, CNTC. Her recent research interest areas are tobacco leaf quality assessment and implication of big data on tobacco leaf quality assessment.In this paper, she is responsible for the application of innovation platform on tobacco leaf quality assessment.E-mail: zhangyanling@ztri.com.cn
基金资助:
中国烟草总公司科技重大专项“烟叶质量大数据构建及应用研究”(110201901025SJ-04);中国科学院海洋大科学研究中心重点部署项目(COMS2019Q17)

Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

ZHAI Qingchen^1,³(),ZHOU Yuanchun¹(),SONG Qiucheng¹(),WANG Jianwei²(),MENG Zhen^1,^*(),ZHANG Yanling^2,^*()

1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
2. Zhengzhou Tobacco Research of CNTC,Zhengzhou,Henan 450001,China
3. University of Chinese Academy of Sciences,Beijing 100049,China

Received:2020-11-24 Online:2021-02-20 Published:2021-02-07
Contact: MENG Zhen,ZHANG Yanling

摘要/Abstract

摘要：

[目的]为了对烟叶产地进行相似性度量和分类,并克服高维空间下距离度量失效的问题。[方法]文章通过方差权重法、主成分分析法及局部线性嵌入法三种方法对烟叶属性指标进行降维和筛选,使用经过特征降维后的数据相似性进行计算及K-means聚类分析。[结果]通过分析不同方法所得输入指标的聚类的轮廓系数发现,方差权重法所筛选出的总植物碱、还原糖、钾、含梗率四个指标的聚类效果较两种降维算法所得的指标的聚类效果更好,对烟叶质量评估有较强的参考价值。K-means聚类算法将烟叶产区分为四类并且得到各类中属性特点,通过相似性算法所得到的结果在以麒麟区、宣威县、罗平县为代表的县区的相似性产地上与业内现有研究相互验证。[结论]文章基于机器学习算法,通过数据挖掘得到烟叶感官数据中的特征性指标与产地之间的相似性特点,为烟叶工业生产提供了具有一定参考价值的指标与结论,也为机器学习在烟草工业中的应用提供了算法基础。

关键词: 数据挖掘, 相似性, 聚类, 烟叶

Abstract:

[Objective] In order to measure and classify the similarity of tobacco leaves and to overcome the problem of invalid distance measurement in high-dimensional space, [Methods] the article uses three methods: variance weighted method, principal component analysis method, and local linear embedding method to reduce the dimension and filter the attributes of tobacco leaves. The similarity is calculated using the selected indicators and K-means algorithm is carried out to cluster tobacco leaves. [Results] By analyzing the cluster silhouette coefficient of the input indicators obtained by different methods, it is found that the clustering effects of the four indicators of total alkaloids, reducing sugars, potassium, and stalk rate selected by the variance weighted method are better than those of other two dimension reduction algorithms. The clustering algorithm divides tobacco leaves into four categories and analyzes the characteristics of various input indicators. And the results obtained through the similarity algorithm are mutually verified with the existing research in the industry at three similar counties: Qilin district, Xuanwei county, and Luoping county. [Conclusions] Based on the machine learning algorithm, this article digs out the similarities between the characteristic indicators in the sensory data of tobacco leaf and the place of origin, providing indicators and conclusions with a certain reference value for the tobacco industry production. It also provides an algorithmic basis for the application of machine learning in the tobacco industry.

Key words: data mining, similarity, clustering, tobacco leaf

翟擎辰,周园春,宋秋成,王建伟,孟珍,张艳玲. 数据降维及聚类算法在烟叶相似性分析中的应用[J]. 数据与计算发展前沿, 2021, 3(1): 112-121.

ZHAI Qingchen,ZHOU Yuanchun,SONG Qiucheng,WANG Jianwei,MENG Zhen,ZHANG Yanling. Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis[J]. Frontiers of Data and Computing, 2021, 3(1): 112-121.

图/表 7

表1

图1

图2

表2

图3

图4

表3

参考文献 13

[1]	李树深. 数据与计算是科技创新的巨大驱动力[J]. 数据与计算发展前沿, 2019,1(1):1.
[2]	孙哲南, 张兆翔, 王威, 刘菲, 谭铁牛. 2019年人工智能新态势与新进展[J]. 数据与计算发展前沿, 2019,1(2):1-16.
[3]	常爱霞, 杜咏梅, 付秋娟, 等. 烤烟主要的化学成分与感官质量的相关性分析[J]. 中国烟草科学, 2009,30(6):9-12.
[4]	王彦棡, 王珏, 曹荣强. 人工智能计算与数据服务平台的研究与应用[J]. 数据与计算发展前沿, 2019,1(2):86-97.
[5]	中国烟草总公司郑州烟草研究院.中国烟叶质量白皮书:2006年-2010年[R/OL]. [2020-06-17].
[6]	曹鹏云, 付秋娟, 宫会丽, 杨宁. 高维空间下烟叶质量相似性度量方法研究[J]. 中国烟草科学, 2013,34(03):84-88.
[7]	杨宁. 计算机辅助卷烟配方设计关键技术研究[D]. 青岛:中国海洋大学, 2010.
[8]	Kulis B. Metric learning: A survey[J]. Foundations and Trends in Machine Learning, 2012,5(4):287-364.
[9]	周志华. 机器学习[M]. 清华大学出版社, 2016: 425.
[10]	Alexandre L.M Levada. Parametric PCA for unsupervised metric learning[J]. Pattern Recognition Letters, 2020,135. pmid: 32406416
[11]	王微. 融合全局和局部信息的度量学习方法研究[D]. 中国科学技术大学, 2014.
[12]	鞠玲, 王正群, 徐春林, 杨洋. 基于Kernel Rank-order距离的重构权重局部线性嵌入算法[J]. 计算机应用与软件, 2020,37(08):149-155+206.
[13]	唐徐红, 矣跃平, 袁仕信, 等. 指纹图谱技术在云南省烤烟质量分类中的应用研究[J]. 湖北农业科学, 2012,51(06):1156-1160.

成份	初始特征值
成份	合计	方差的 %	累积 %
1	4.368	33.598	33.598
2	2.518	19.368	52.967
3	1.071	8.235	61.201
4	0.969	7.454	68.655
5	0.794	6.104	74.759
6	0.680	5.230	79.989
7	0.577	4.439	84.428
8	0.568	4.371	88.799
9	0.484	3.724	92.523
10	0.382	2.940	95.463
11	0.250	1.926	97.389
12	0.226	1.739	99.128
13	0.113	0.872	100.000

属性	类1	类2	类3	类4
产地数	123	10	51	24
还原糖 %	27.92	29.87	23.63	24.83
含梗率 %	30.84	23.63	32.37	27.45
钾 %	2.02	1.48	2.04	1.66
总植物碱 %	2.22	1.60	2.53	2.33

数据降维及聚类算法在烟叶相似性分析中的应用

Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 13

相关文章 8

编辑推荐

Metrics

本文评价

产地	相近产地
麒麟区	宜良县、宣威县、江华瑶族自治县、宁远县、罗平县
宣威县	保康县、罗平县、陆良县、麒麟区、腾冲县
罗平县	宣威市、利川市、永兴县、保康县、腾冲县

[1]	张猛,李健. 鸟类音频数据预处理方法[J]. 数据与计算发展前沿, 2021, 3(5): 130-140.
[2]	祁荣苓,焦文彬,汪洋. 基于句子向量表示和模糊C均值的电子政务文档自动摘要技术[J]. 数据与计算发展前沿, 2021, 3(2): 103-111.
[3]	李言,陈远平. 科研信息门户的资源推荐技术研究[J]. 数据与计算发展前沿, 2021, 3(2): 112-119.
[4]	刘春雨,施卓敏,于建军. 基于树模型的财务报销审批预测[J]. 数据与计算发展前沿, 2021, 3(2): 60-67.
[5]	杨润佳,刘泽三. 一种工业报警相关性数据挖掘算法[J]. 数据与计算发展前沿, 2020, 2(5): 110-121.
[6]	葛胤池,张辉,宋文燕,王轩. 基于领域本体的科技资源聚类方法研究[J]. 数据与计算发展前沿, 2020, 2(5): 13-22.
[7]	董家源,杨小渝. 材料数据挖掘与机器学习工具的集成与优化[J]. 数据与计算发展前沿, 2020, 2(4): 105-120.
[8]	王国胤, 于洪. 多粒度认知计算——一种大数据智能计算的新模型[J]. 数据与计算发展前沿, 2019, 1(2): 75-85.