数据与计算发展前沿 ›› 2021, Vol. 3 ›› Issue (1): 112-121.

doi: 10.11871/jfdc.issn.2096-742X.2021.01.009

• 技术与应用 • 上一篇    

数据降维及聚类算法在烟叶相似性分析中的应用

翟擎辰1,3(),周园春1(),宋秋成1(),王建伟2(),孟珍1,*(),张艳玲2,*()   

  1. 1.中国科学院计算机网络信息中心,北京 100190
    2.中国烟草总公司郑州烟草研究院,河南 郑州 450001
    3.中国科学院大学,北京 100049
  • 收稿日期:2020-11-24 出版日期:2021-02-20 发布日期:2021-02-07
  • 通讯作者: 孟珍,张艳玲
  • 作者简介:翟擎辰,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,研究方向为机器学习、无监督学习、表征学习等。本文中负责数据处理、模型构建、实验、论文撰写。
    ZHAI Qingchen is a graduate student of Computer Network Information Center of Chinese Academy of Sciences. His research directions are machine learning, unsupervised learning, representation learning, etc.In this paper, he is responsible for data processing, model construction, experiments, and paper writing.E-mail: zhaiqingchen@cnic.cn|周园春,中国科学院计算机网络信息中心,副主任,博士,研究员,博士生导师,中国科学院特聘研究员,中心学位评定委员会主席,大数据应用与技术发展部主任,大数据分析与计算技术国家地方联合工程实验室秘书长,中科院信息化专项科学大数据工程负责人。发表SCI/EI收录论文90多篇。主要研究方向为大数据分析与处理。本文主要承担工作为数据挖掘算法研究框架设计。
    ZHOU Yuanchun is the research fellow, Ph.D. supervisor and the assistant director in Computer Network Information Center, Chinese Academy of Sciences and the director of the Department of Big Data Technology and Application Development. He is also the chairman of the Degree Evaluation Committee in Computer Network Information Center, Chinese Academy of Sciences. His research interests include big data analysis and processing. He has published more than 90 SCI/EI papers.In this paper, he is responsible for the design of data mining algorithm research framework.E-mail: zyc@cnic.cn|宋秋成,中科创嘉公司,具有丰富的从事前端开发、数据可视化等工作的经验。本文中负责数据可视化。
    SONG qiucheng, employee of Zhongke Chuangjia Company, has rich experience in front-end development, data visualization, etc.In this paper, he is responsible for data visualization.E-mail: sqc@cnic.cn|王建伟,中国烟草总公司郑州烟草研究院,硕士,高级农艺师,硕士研究生导师,主要研究方向为烟叶生产技术与烟叶质量大数据应用。本文中负责对烟叶相似性结果以烟草行业角度进行评估。
    WANG Jianwei, master, is a senior agronomist and master tutor of Zhengzhou Tobacco Research Institute of China National Tobacco Corporation. His main research directions are the tobacco leaf production technology and big data applications of tobacco leaf quality.In this paper, he is responsible for evaluating the similarity results of tobacco leaves from the perspective of the tobacco industry.E-mail: wangjw@ztri.com.cn|孟珍,中国科学院计算机网络信息中心,高级工程师,硕士研究生导师,大数据技术与应用发展部数据资源与应用实验室副主任,主要研究方向为多源异构数据的融合管理与关联技术、面向领域大数据分析模型与云服务技术。发表SCI/EI收录论文20多篇。本文中负责数据挖掘算法研究框架设计与指导。
    MENG Zhen is a senior engineer and the master supervisor at the Department of Big Data Technology and Application Development at Computer Network Information Center, Chinese Academy of Sciences. She is the deputy director of the Resource and Application Lab at the Department of Big Data Technology and Application Development. Her research interests include big data management, processing, mining, analysis and other related technologies. And she has published over 20 papers included in SCI/EI.In this paper, she is mainly responsible for the overall design and guidance of data mining algorithm research frameworks.E-mail: zhenm99@cnic.cn|张艳玲,烟草行业生态环境与烟叶质量重点实验室,副主任,博士,硕士生导师,主要研究方向为烟叶质量评价、大数据技术在烟叶质量评价中的应用。本文中负责创新平台在烟叶质量评价中的典型应用。
    ZHANG Yanling, Ph.D., is a researcher and deputy director of the key laboratory of environmental and tobacco leaf, CNTC. Her recent research interest areas are tobacco leaf quality assessment and implication of big data on tobacco leaf quality assessment.In this paper, she is responsible for the application of innovation platform on tobacco leaf quality assessment.E-mail: zhangyanling@ztri.com.cn
  • 基金资助:
    中国烟草总公司科技重大专项“烟叶质量大数据构建及应用研究”(110201901025SJ-04);中国科学院海洋大科学研究中心重点部署项目(COMS2019Q17)

Application of Data Dimension Reduction and Clustering Algorithm in Tobacco Leaf Similarity Analysis

ZHAI Qingchen1,3(),ZHOU Yuanchun1(),SONG Qiucheng1(),WANG Jianwei2(),MENG Zhen1,*(),ZHANG Yanling2,*()   

  1. 1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
    2. Zhengzhou Tobacco Research of CNTC,Zhengzhou,Henan 450001,China
    3. University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2020-11-24 Online:2021-02-20 Published:2021-02-07
  • Contact: MENG Zhen,ZHANG Yanling

摘要:

[目的]为了对烟叶产地进行相似性度量和分类,并克服高维空间下距离度量失效的问题。[方法]文章通过方差权重法、主成分分析法及局部线性嵌入法三种方法对烟叶属性指标进行降维和筛选,使用经过特征降维后的数据相似性进行计算及K-means聚类分析。[结果]通过分析不同方法所得输入指标的聚类的轮廓系数发现,方差权重法所筛选出的总植物碱、还原糖、钾、含梗率四个指标的聚类效果较两种降维算法所得的指标的聚类效果更好,对烟叶质量评估有较强的参考价值。K-means聚类算法将烟叶产区分为四类并且得到各类中属性特点,通过相似性算法所得到的结果在以麒麟区、宣威县、罗平县为代表的县区的相似性产地上与业内现有研究相互验证。[结论]文章基于机器学习算法,通过数据挖掘得到烟叶感官数据中的特征性指标与产地之间的相似性特点,为烟叶工业生产提供了具有一定参考价值的指标与结论,也为机器学习在烟草工业中的应用提供了算法基础。

关键词: 数据挖掘, 相似性, 聚类, 烟叶

Abstract:

[Objective] In order to measure and classify the similarity of tobacco leaves and to overcome the problem of invalid distance measurement in high-dimensional space, [Methods] the article uses three methods: variance weighted method, principal component analysis method, and local linear embedding method to reduce the dimension and filter the attributes of tobacco leaves. The similarity is calculated using the selected indicators and K-means algorithm is carried out to cluster tobacco leaves. [Results] By analyzing the cluster silhouette coefficient of the input indicators obtained by different methods, it is found that the clustering effects of the four indicators of total alkaloids, reducing sugars, potassium, and stalk rate selected by the variance weighted method are better than those of other two dimension reduction algorithms. The clustering algorithm divides tobacco leaves into four categories and analyzes the characteristics of various input indicators. And the results obtained through the similarity algorithm are mutually verified with the existing research in the industry at three similar counties: Qilin district, Xuanwei county, and Luoping county. [Conclusions] Based on the machine learning algorithm, this article digs out the similarities between the characteristic indicators in the sensory data of tobacco leaf and the place of origin, providing indicators and conclusions with a certain reference value for the tobacco industry production. It also provides an algorithmic basis for the application of machine learning in the tobacco industry.

Key words: data mining, similarity, clustering, tobacco leaf