数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (3): 30-45.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.003

doi: 10.11871/jfdc.issn.2096-742X.2022.03.003

• 专刊:先进智能计算平台及应用(下) • 上一篇    下一篇

NKCorpus:利用海量网络数据构建大型高质量中文数据集

李东闻(),钟震宇(),申峻宇(),王昊天(),孙羽菲(),张玉志()   

  1. 南开大学,软件学院,天津 300350
  • 收稿日期:2022-02-18 出版日期:2022-06-20 发布日期:2022-06-20
  • 通讯作者: 孙羽菲
  • 作者简介:李东闻,南开大学,博士研究生,主要研究领域为自然语言处理及深度学习。
    本文中负责项目整体开发及论文撰写。
    LI Dongwen is currently a doctoral stud-ent in the College of Software at Nankai University, Tianjin, China. Her current research interests include anomaly detection, deep learning and natural language processing.
    In this paper, she is mainly responsible for the overall project development and paper writing.
    E-mail: lidongwen@mail.nankai.edu.cn|钟震宇,南开大学,硕士研究生,主要研究领域为自然语言处理、高性能计算、智能运维。
    本文主要负责项目整体开发及论文撰写。
    ZHONG Zhenyu is currently a master’s student in the College of Software at Nankai University, Tianjin, China. His current research interests include natural language processing, high performance computing and AIOps.
    In this paper, he is mainly responsible for the overall project development and paper writing.
    E-mail: zyzhong@mail.nankai.edu.cn|申峻宇,南开大学,本科生,主要研究领域为自然语言处理。
    本文主要参与论文撰写及数据去重环节的项目开发。
    SHEN Junyu is currently an undergra-duate in the College of Software at Nankai University, Tianjin, China. His current research interest is natural language processing.
    In this paper, he is mainly involved in the project development of data deduplication and paper writing.
    E-mail: junyu_nk@163.com|王昊天,南开大学,博士研究生,主要研究领域为自然语言处理。
    本文主要参与数据去重和高质量数据筛选环节的项目开发。
    WANG Haotian is currently a doctoral student in the College of Software at Nankai University, Tianjin, China. His current research interest is natural lang-uage processing.
    In this paper, he is mainly involved in the project development of data deduplication and high-quality data filtering.
    E-mail: 1392300702@qq.com|孙羽菲,南开大学,软件学院,特聘研究员,博士,主要研究方向为深度学习、异构计算、人工智能等。
    本文主要负责项目指导及总体统稿。
    SUN Yufei, Ph.D., is a professor at the College of Software, Nankai University. Her research interests include deep learn-ing, heterogeneous computing, artifi-cial intelligence, etc.
    In this paper, she is mainly responsible for project guidance and the final compilation, edition of the paper.
    E-mail: yufei_sun@sina.com|张玉志,南开大学,软件学院,院长,讲席教授,博士,主要研究方向为深度学习及其他人工智能相关领域。
    本文主要负责文献调研及项目指导。
    ZHANG Yuzhi, Ph.D., is currently a dis-tinguished professor and the dean of the College of Software at Nankai Univer-sity, Tianjin, China. His research interests include deep learning and other aspects in artificial intelligence.
    In this paper, he is mainly responsible for project guidance and the related work investigation.
    E-mail: zyz@nankai.edu.cn

NKCorpus: Extracting High Quality Large Chinese Dataset from Web Data

LI Dongwen(),ZHONG Zhenyu(),SHEN Junyu(),WANG Haotian(),SUN Yufei(),ZHANG Yuzhi()   

  1. College of software, Nankai University, Tianjin 300350, China
  • Received:2022-02-18 Online:2022-06-20 Published:2022-06-20
  • Contact: SUN Yufei

摘要:

【目的】大规模、高质量的中文数据集对于大型中文预训练语言模型及其他自然语言处理模型的训练至关重要,因此需要设计并完善一种可以构建大规模中文数据集的框架。【方法】利用语言提取、文本清洗、数据去重等多种方法对原始数据进行处理获取数据集,并利用并行技术对数据处理框架的效率进行优化。【结果】提出了一个流程完善且高效的可以利用海量网络数据构建大型高质量中文数据集的框架NKCorpus,并且利用NKCorpus构建了约700GB的可直接用于中文预训练语言模型的训练工作的高质量中文数据集。【结论】NKCorpus已能够基本满足当前对于大规模、高质量中文数据集的高效构建需求。

关键词: 自然语言处理, 中文数据集, 数据集构建

Abstract:

[Objective] For large-scale Chinese pre-trained language models or other natural language processing models, it is very important to collect and process large-scale high-quality Chinese data for model training. Therefore, a comprehensive large-scale dataset construction framework is required. [Methods] We use pipeline preprocessing procedures such as language extraction, text cleaning, and deduplication to process data. The performance of our framework is also optimized by parallel computing techniques. [Results] A comprehensive and efficient dataset construction framework NKCorpus is proposed to construct large-scale high-quality Chinese corpus datasets from massive web data, and a high-quality Chinese dataset of about 700GB is constructed using NKCorpus. [Conclusions] NKCorpus can meet the current needs for the efficient construction of large-scale, high-quality Chinese datasets.

Key words: natural language processing, Chinese dataset, dataset construction