数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (3): 30-45.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.003

doi: 10.11871/jfdc.issn.2096-742X.2022.03.003

• 专刊:先进智能计算平台及应用(下) • 上一篇    下一篇



  1. 南开大学,软件学院,天津 300350
  • 收稿日期:2022-02-18 出版日期:2022-06-20 发布日期:2022-06-20
  • 通讯作者: 孙羽菲
  • 作者简介:李东闻,南开大学,博士研究生,主要研究领域为自然语言处理及深度学习。
    LI Dongwen is currently a doctoral stud-ent in the College of Software at Nankai University, Tianjin, China. Her current research interests include anomaly detection, deep learning and natural language processing.
    In this paper, she is mainly responsible for the overall project development and paper writing.
    E-mail: lidongwen@mail.nankai.edu.cn|钟震宇,南开大学,硕士研究生,主要研究领域为自然语言处理、高性能计算、智能运维。
    ZHONG Zhenyu is currently a master’s student in the College of Software at Nankai University, Tianjin, China. His current research interests include natural language processing, high performance computing and AIOps.
    In this paper, he is mainly responsible for the overall project development and paper writing.
    E-mail: zyzhong@mail.nankai.edu.cn|申峻宇,南开大学,本科生,主要研究领域为自然语言处理。
    SHEN Junyu is currently an undergra-duate in the College of Software at Nankai University, Tianjin, China. His current research interest is natural language processing.
    In this paper, he is mainly involved in the project development of data deduplication and paper writing.
    E-mail: junyu_nk@163.com|王昊天,南开大学,博士研究生,主要研究领域为自然语言处理。
    WANG Haotian is currently a doctoral student in the College of Software at Nankai University, Tianjin, China. His current research interest is natural lang-uage processing.
    In this paper, he is mainly involved in the project development of data deduplication and high-quality data filtering.
    E-mail: 1392300702@qq.com|孙羽菲,南开大学,软件学院,特聘研究员,博士,主要研究方向为深度学习、异构计算、人工智能等。
    SUN Yufei, Ph.D., is a professor at the College of Software, Nankai University. Her research interests include deep learn-ing, heterogeneous computing, artifi-cial intelligence, etc.
    In this paper, she is mainly responsible for project guidance and the final compilation, edition of the paper.
    E-mail: yufei_sun@sina.com|张玉志,南开大学,软件学院,院长,讲席教授,博士,主要研究方向为深度学习及其他人工智能相关领域。
    ZHANG Yuzhi, Ph.D., is currently a dis-tinguished professor and the dean of the College of Software at Nankai Univer-sity, Tianjin, China. His research interests include deep learning and other aspects in artificial intelligence.
    In this paper, he is mainly responsible for project guidance and the related work investigation.
    E-mail: zyz@nankai.edu.cn

NKCorpus: Extracting High Quality Large Chinese Dataset from Web Data

LI Dongwen(),ZHONG Zhenyu(),SHEN Junyu(),WANG Haotian(),SUN Yufei(),ZHANG Yuzhi()   

  1. College of software, Nankai University, Tianjin 300350, China
  • Received:2022-02-18 Online:2022-06-20 Published:2022-06-20
  • Contact: SUN Yufei



关键词: 自然语言处理, 中文数据集, 数据集构建


[Objective] For large-scale Chinese pre-trained language models or other natural language processing models, it is very important to collect and process large-scale high-quality Chinese data for model training. Therefore, a comprehensive large-scale dataset construction framework is required. [Methods] We use pipeline preprocessing procedures such as language extraction, text cleaning, and deduplication to process data. The performance of our framework is also optimized by parallel computing techniques. [Results] A comprehensive and efficient dataset construction framework NKCorpus is proposed to construct large-scale high-quality Chinese corpus datasets from massive web data, and a high-quality Chinese dataset of about 700GB is constructed using NKCorpus. [Conclusions] NKCorpus can meet the current needs for the efficient construction of large-scale, high-quality Chinese datasets.

Key words: natural language processing, Chinese dataset, dataset construction