数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (1): 117-127.

doi: 10.11871/jfdc.issn.2096-742X.2020.01.010

所属专题: “高性能与高通量计算及应用”专刊

• 专刊:高性能与高通量计算及应用 • 上一篇    下一篇

高通量计算在大规模人群队列基因组数据解析应用中的挑战

曾瀞瑶1,苑娜1,魏文娟2,李根2,*(),杜政霖1,*()   

  1. 1.中国科学院北京基因组研究所,国家基因组科学数据中心,北京 100101
    2.人和未来生物科技(长沙)有限公司,湖南 长沙 410152
  • 收稿日期:2019-11-29 出版日期:2020-02-20 发布日期:2020-03-28
  • 通讯作者: 李根,杜政霖
  • 作者简介:曾瀞瑶,中国科学院北京基因组研究所国家基因组科学数据中心,博士,助理研究员,主要从事大规模人群队列基因组研究与肿瘤多组学数据整合和挖掘,并主导了中国人遗传变异数据库的设计与开发。
    在本文贡献了第一章的写作并修改全文。
    Zeng Jingyao, PhD, is a research assistant in National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences. The research area of her focuses mainly on genomic analysis of large-scale cohort studies and neoplastic omics data integration and mining. And she was also responsible for the design and development of Chinese Genomic Variation Database.
    She contributed to section 1 and the paper modification.
    E-mail: zengjy@big.ac.cn|苑娜,中国科学院北京基因组研究所生命与健康大数据中心,计算机硕士,工程师,主要从事大规模人群队列遗传变异解析及数据库开发。
    本文贡献了第二章的写作。
    Yuan Na, is currently an engineer in Big Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, mainly engaged in large-scale cohort studies and database website development.
    She contributed to section 2.
    E-mail:yuann@big.ac.cn|魏文娟,博士,在圣路易斯华盛顿大学和中国科学院上海生化所联合培养硕博连读(2008-2014),后在阿肯色儿童医院及北卡州立大学进行临床及科研工作(2014-2015)。回国后加入人和未来参与基因检测与基因数据相关工作(2016至今)。在分子生物学、生物统计、生物信息学、精准医疗方面具有丰富的经验。
    在本文中与李根共同负责第三与第四章的写作。
    Wei Wenjuan was a joint Ph.D. student of Washington University in St. Louis, School of Medicine and Chinese academy of science, Shanghai institute for biological science (2008-2014). After graduation, she worked in Arkansas Children’s Hospital and North Carolina State University on biostatistics (2014-2015). Later joined Genetalks and worked on genetic test and genetic data solutions (2016-now). Dr. Wei has wide experience in molecular biology, biostatistics, bioinformatics, and precision medicine.
    She and Li Gen co-contributed to the writing of sections 3 and 4.
    E-mail:weiwenjuan@sibcb.ac.cn|李根,目前任人和未来生物科技(长沙)有限责任公司首席信息官,博士毕业于国防科技大学,主要研究领域包括基因分析高性能计算、动态编译优化、基因数据压缩。
    与魏文娟共同完成了本文第三与第四章的写作。
    Li Gen, is currently CIO of Genetalks Inc., He received Ph.D. degree in Computer Science and Technology at National University of Defense Technology. His research interest includes high-performance computing for genomic analysis, dynamic compilation optimization, and genomic data compression.
    He and Wei Wenjuan co-contributed to the writing of sections 3 and 4.|杜政霖,中国科学院北京基因组研究所生命与健康大数据中心,博士,高级工程师,主要研究方向是基因组学和生物信息学。
    负责文章整体架构设计与摘要撰写。
    Du Zhenglin, is currently a senior engineer in National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences. His research interest includes genomics and bioinformatics.
    He contributed to the organization of the paper and the writing of abstract.
  • 基金资助:
    国家重点研发计划“疾病研究精准医学知识库构建”(2016YFC0901900);国家重点研发计划“基于国家高性能计算环境的生物医药应用服务社区”(2016YFB0201700)

Challenges of High-Throughput Computing in Genomic Data Analysis for Large-Scale Cohort Studies

Zeng Jingyao1,Yuan Na1,Wei Wenjuan2,Li Gen2,*(),Du Zhenglin1,*()   

  1. 1.National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Bejing 100101, China
    2.Genetalks Biotechnology Co., Ltd., Changsha, Hunan 410152, China
  • Received:2019-11-29 Online:2020-02-20 Published:2020-03-28
  • Contact: Li Gen,Du Zhenglin

摘要:

【目的】为推动精准医学研究的发展,世界各国相继开展大规模人群队列基因组测序计划,通过对数以万计个体进行全基因组测序,构建人群特异的基因组变异图谱。这些海量基因组数据产出,对计算速度和计算通量提出了新的要求,迫切需要速度更快、通量更高的计算平台来处理与解读这些生物序列信息。由于基因组数据自身的特点、数据解析过程的多样性和复杂性,致使在大规模人群基因组变异解析中高通量计算资源的使用效率低、计算速度慢、耗时长,服务器与本地数据交换不便,因此需要针对基因组变异解析进行多方面优化,通过软硬件开发来解决应用中存在的多种问题。本文拟对这些优化方法进行分析和综述。【方法】在高通量计算系统中,系统IO瓶颈问题是基因组变异解析并行化效率低的主要原因,通常采用基于分布式非结构化存储数据库以及对象存储系统,以提升IO的大规模可扩展能力,解决分析流程中存在的IO问题;同时通过基因组数据的高效压缩算法,可减少数据IO和传输压力。为了加快基因组数据解析速度,可在软件上采用神经网络等算法优化基因组解析方法,在硬件上使用FPGA(现场可编程逻辑门阵列)或GPU异构计算,以提高数据处理速度。【结果】综合来看,以上多方面的优化可以大幅提升基因组数据分析中高通量计算的性能,解决基因组数据处理中的存储墙问题,提高高通量计算资源的使用效率,大大减少全基因组变异解析的计算时间。【结论】高通量计算在基因组数据解析应用中存在的多种问题,可通过软硬件开发和优化得以解决,从而显著改进高通量计算在大规模人群队列变异解析应用中的计算效率,促进今后人群队列基因组研究与应用的广泛开展。

关键词: 高通量计算, IO性能, 基因组变异解析, 异构加速, 数据压缩

Abstract:

[Objective] In order to promote the precision medicine research, large-scale population genomic studies have been carried out globally, and population-specific genome variation maps have been built by whole genome sequencing of thousands of individuals. These projects output massive genomic data, which needs high-throughput computing (HTC) to process. However, due to the characteristics of genomic data and the diversity and complexity of process workflows, HTC computing resources are not fully utilized in genomic data analysis tasks, so that the computing speed is slow and the data exchange over servers is inconvenient. Therefore, it is necessary to optimize HTC platforms for genomic data analysis from software and hardware aspects. This paper analyzes and summarizes these optimization methods. [Methods] In an HTC system, the bottleneck of system IO is the main cause for the low parallel efficiency in genomic data processing. Generally, distributed unstructured storage database and object storage system are used to improve the scalability of large-scale IO and solve the IO problems in data processing. Meanwhile, the IO load can be reduced by using the efficient compression algorithms of genomic data. In order to accelerate genomic data processing, algorithms such as neural networks can be used to optimize genome analysis methods, and FPGA or GPU heterogeneous computing can be used to improve the speed of data analysis. [Results] In brief, the above optimization can greatly improve HTC performance by solving the IO wall problem in genomic data analysis and improving the efficiency of HTC resources, which greatly reduces the computing time of genome-wide variation analysis. [Conclusions] The software and hardware improvements can significantly increase the HTC efficiency and speed in genomic data analysis, and can promote the application of high-throughput computing on large-scale cohort studies in the future.

Key words: high throughput computing, IO performance, genomic variation analysis, heterogeneous acceleration, data compression