Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (1): 175-185.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.01.013

doi: 10.11871/jfdc.issn.2096-742X.2025.01.013

• Technology and Application • Previous Articles     Next Articles

Information Extraction from Chinese Wheat Varieties Journal Based on Large Language Model

WEI Yijin1,2(),CHEN Yanqing3,WANG Xiudong4,5,*(),FAN Jingchao1,2   

  1. 1. Agriculture Information Institution of CAAS, Beijing 100081, China
    2. National Agriculture Science Data Center, Beijing 100081, China
    3. Institute of Crop Sciences, CAAS, Beijing 100081, China
    4. Institution of Agricultural Economics and Development, CAAS, Beijing 100081, China
    5. Center for Strategic Studies, CAAS, Beijing 100081, China
  • Received:2024-05-23 Online:2025-02-20 Published:2025-02-21

Abstract:

[Objective] In order to promote the transformation of wheat germplasm resources to wheat industry advantages and to improve the richness of wheat genetic background, this paper presents a study of information mining from the published three-volume Chinese Wheat Variety Journal based on the Large Language Model (LLM) and cue word engineering. [Methods] This project involves scanning the paper version of the Chinese Wheat Variety Journal and performing OCR recognition and other data processing tasks to obtain wheat variety data. We aim to develop key extraction indices for wheat variety data and the corresponding prompt words of the LLMs for the needs of breeding work. By calling commercial LLM API interfaces, the key information of wheat variety data will be automatically extracted. The result will be a well-established workflow for extracting wheat variety information using large language models. [Results] The calculation of precision rate, recall rate, and F1 value in terms of the number of actually existing relations, the number of recognized relations, and the number of correctly recognized relations in the information extraction task show that this wheat varietal journal information extraction scheme achieved more than 0.89 precision rate, 0.73 recall rate, and 0.84 F1 value in the information extraction for the three volumes of Chinese Wheat Varietal Journal that have been published. [Conclusions] The high accuracy of this wheat varietal journal information extraction scheme indicates that it is fully capable of achieving precise information extraction, but the recall rate also indicates that the scheme has the problem that some information cannot be recognized. Though the scheme is overall feasible in terms of the combined F1 score, further manual verification and review of the extraction results is still required.

Key words: Large Language Model (LLM), agriculture, wheat, information mining, genetic resources