数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (1): 175-185.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.01.013

doi: 10.11871/jfdc.issn.2096-742X.2025.01.013

• 技术与应用 • 上一篇    下一篇

基于大语言模型的《中国小麦品种志》信息提取

韦一金1,2(),陈彦清3,王秀东4,5,*(),樊景超1,2   

  1. 1.中国农业科学院农业信息研究所,北京 100081
    2.国家农业科学数据中心,北京 100081
    3.中国农业科学院作物科学研究所,北京 100081
    4.中国农业科学院农业经济与发展研究所,北京 100081
    5.中国农业科学院战略研究中心,北京 100081
  • 收稿日期:2024-05-23 出版日期:2025-02-20 发布日期:2025-02-21
  • 通讯作者: *王秀东(E-mail: wangxiudong@caas.cn
  • 作者简介:韦一金,中国农业科学院农业信息研究所,硕士研究生,研究方向为农业信息技术。负责实验与论文撰写。
    WEI Yijin, is a master’s student at the Institute of Agricultural Information, Chinese Academy of Agricultural Sciences (CAAS). Her research interests include agricultural information technology.
    In this paper, she is responsible for conducting experiments and writing the manuscript.
    E-mail: weiyijin0816@163.com|王秀东,中国农业科学院农业经济与发展研究所,博士,研究员,研究方向为粮食安全及农业发展战略。
    负责论文审定,参与小麦种质信息指标构建。
    WANG Xiudong, Ph.D., is a professor at the Institute of Agricultural Economics and Development, Chinese Academy of Agricultural Sciences (CAAS). His research interests include food security and agricultural development strategies.
    In this paper, he is responsible for the finalization of papers and the construction of wheat germplasm information indicators.
    E-mail: wangxiudong@caas.cn
  • 基金资助:
    中国农业科学院农业信息研究所科技创新工程(CAAS-ASTIP-2024-AII)

Information Extraction from Chinese Wheat Varieties Journal Based on Large Language Model

WEI Yijin1,2(),CHEN Yanqing3,WANG Xiudong4,5,*(),FAN Jingchao1,2   

  1. 1. Agriculture Information Institution of CAAS, Beijing 100081, China
    2. National Agriculture Science Data Center, Beijing 100081, China
    3. Institute of Crop Sciences, CAAS, Beijing 100081, China
    4. Institution of Agricultural Economics and Development, CAAS, Beijing 100081, China
    5. Center for Strategic Studies, CAAS, Beijing 100081, China
  • Received:2024-05-23 Online:2025-02-20 Published:2025-02-21

摘要:

【目的】为促进小麦种质资源向小麦产业优势转化、提高小麦遗传背景丰富性,本文基于大语言模型(Large Language Model, LLM)和提示词工程,针对已出版的三卷《中国小麦品种志》进行信息挖掘。【方法】扫描《中国小麦品种志》纸质版文稿并进行OCR识别等数据处理工作以获取小麦品种数据,构建面向育种工作需求的小麦品种数据关键提取指标和相应的大语言模型提示词,以调用商业LLM api接口的方式对小麦品种数据的关键信息进行自动化提取,并形成一套成熟的基于大语言模型的小麦品种信息提取工作方案。【结果】以信息提取任务中的实际存在关系个数、识别出的关系个数、正确识别的关系个数进行精确率、召回率和F1值的计算,结果表明该小麦品种志信息提取方案在已出版的三卷《中国小麦品种志》信息提取中均达到了0.89以上的准确率、0.73以上的召回率和0.84以上的F1值。【结论】小麦品种志信息提取方案的高准确率表明其完全有能力实现精准信息提取,但是召回率又表明该方案存在部分信息无法识别的问题,因此虽然综合F1值而言该方案整体可行,但仍需对提取结果进行进一步的人工核验及审查。

关键词: 大语言模型, 农业, 小麦, 信息挖掘, 种质资源

Abstract:

[Objective] In order to promote the transformation of wheat germplasm resources to wheat industry advantages and to improve the richness of wheat genetic background, this paper presents a study of information mining from the published three-volume Chinese Wheat Variety Journal based on the Large Language Model (LLM) and cue word engineering. [Methods] This project involves scanning the paper version of the Chinese Wheat Variety Journal and performing OCR recognition and other data processing tasks to obtain wheat variety data. We aim to develop key extraction indices for wheat variety data and the corresponding prompt words of the LLMs for the needs of breeding work. By calling commercial LLM API interfaces, the key information of wheat variety data will be automatically extracted. The result will be a well-established workflow for extracting wheat variety information using large language models. [Results] The calculation of precision rate, recall rate, and F1 value in terms of the number of actually existing relations, the number of recognized relations, and the number of correctly recognized relations in the information extraction task show that this wheat varietal journal information extraction scheme achieved more than 0.89 precision rate, 0.73 recall rate, and 0.84 F1 value in the information extraction for the three volumes of Chinese Wheat Varietal Journal that have been published. [Conclusions] The high accuracy of this wheat varietal journal information extraction scheme indicates that it is fully capable of achieving precise information extraction, but the recall rate also indicates that the scheme has the problem that some information cannot be recognized. Though the scheme is overall feasible in terms of the combined F1 score, further manual verification and review of the extraction results is still required.

Key words: Large Language Model (LLM), agriculture, wheat, information mining, genetic resources