Frontiers of Data and Domputing ›› 2022, Vol. 4 ›› Issue (3): 30-45.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.003

doi: 10.11871/jfdc.issn.2096-742X.2022.03.003

• Special Issue: Advanced Intelligent Computing Platform and Application • Previous Articles     Next Articles

NKCorpus: Extracting High Quality Large Chinese Dataset from Web Data

LI Dongwen(),ZHONG Zhenyu(),SHEN Junyu(),WANG Haotian(),SUN Yufei(),ZHANG Yuzhi()   

  1. College of software, Nankai University, Tianjin 300350, China
  • Received:2022-02-18 Online:2022-06-20 Published:2022-06-20
  • Contact: SUN Yufei E-mail:lidongwen@mail.nankai.edu.cn;zyzhong@mail.nankai.edu.cn;junyu_nk@163.com;1392300702@qq.com;yufei_sun@sina.com;zyz@nankai.edu.cn

Abstract:

[Objective] For large-scale Chinese pre-trained language models or other natural language processing models, it is very important to collect and process large-scale high-quality Chinese data for model training. Therefore, a comprehensive large-scale dataset construction framework is required. [Methods] We use pipeline preprocessing procedures such as language extraction, text cleaning, and deduplication to process data. The performance of our framework is also optimized by parallel computing techniques. [Results] A comprehensive and efficient dataset construction framework NKCorpus is proposed to construct large-scale high-quality Chinese corpus datasets from massive web data, and a high-quality Chinese dataset of about 700GB is constructed using NKCorpus. [Conclusions] NKCorpus can meet the current needs for the efficient construction of large-scale, high-quality Chinese datasets.

Key words: natural language processing, Chinese dataset, dataset construction