Frontiers of Data and Computing ›› 2020, Vol. 2 ›› Issue (1): 117-127.

doi: 10.11871/jfdc.issn.2096-742X.2020.01.010

Special Issue: “高性能与高通量计算及应用”专刊

Previous Articles     Next Articles

Challenges of High-Throughput Computing in Genomic Data Analysis for Large-Scale Cohort Studies

Zeng Jingyao1,Yuan Na1,Wei Wenjuan2,Li Gen2,*(),Du Zhenglin1,*()   

  1. 1.National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Bejing 100101, China
    2.Genetalks Biotechnology Co., Ltd., Changsha, Hunan 410152, China
  • Received:2019-11-29 Online:2020-02-20 Published:2020-03-28
  • Contact: Li Gen,Du Zhenglin E-mail:gen.li@genetalks.com;duzhl@big.ac.cn

Abstract:

[Objective] In order to promote the precision medicine research, large-scale population genomic studies have been carried out globally, and population-specific genome variation maps have been built by whole genome sequencing of thousands of individuals. These projects output massive genomic data, which needs high-throughput computing (HTC) to process. However, due to the characteristics of genomic data and the diversity and complexity of process workflows, HTC computing resources are not fully utilized in genomic data analysis tasks, so that the computing speed is slow and the data exchange over servers is inconvenient. Therefore, it is necessary to optimize HTC platforms for genomic data analysis from software and hardware aspects. This paper analyzes and summarizes these optimization methods. [Methods] In an HTC system, the bottleneck of system IO is the main cause for the low parallel efficiency in genomic data processing. Generally, distributed unstructured storage database and object storage system are used to improve the scalability of large-scale IO and solve the IO problems in data processing. Meanwhile, the IO load can be reduced by using the efficient compression algorithms of genomic data. In order to accelerate genomic data processing, algorithms such as neural networks can be used to optimize genome analysis methods, and FPGA or GPU heterogeneous computing can be used to improve the speed of data analysis. [Results] In brief, the above optimization can greatly improve HTC performance by solving the IO wall problem in genomic data analysis and improving the efficiency of HTC resources, which greatly reduces the computing time of genome-wide variation analysis. [Conclusions] The software and hardware improvements can significantly increase the HTC efficiency and speed in genomic data analysis, and can promote the application of high-throughput computing on large-scale cohort studies in the future.

Key words: high throughput computing, IO performance, genomic variation analysis, heterogeneous acceleration, data compression