Challenges of High-Throughput Computing in Genomic Data Analysis for Large-Scale Cohort Studies

doi:10.11871/jfdc.issn.2096-742X.2020.01.010

Abstract

Abstract:

[Objective] In order to promote the precision medicine research, large-scale population genomic studies have been carried out globally, and population-specific genome variation maps have been built by whole genome sequencing of thousands of individuals. These projects output massive genomic data, which needs high-throughput computing (HTC) to process. However, due to the characteristics of genomic data and the diversity and complexity of process workflows, HTC computing resources are not fully utilized in genomic data analysis tasks, so that the computing speed is slow and the data exchange over servers is inconvenient. Therefore, it is necessary to optimize HTC platforms for genomic data analysis from software and hardware aspects. This paper analyzes and summarizes these optimization methods. [Methods] In an HTC system, the bottleneck of system IO is the main cause for the low parallel efficiency in genomic data processing. Generally, distributed unstructured storage database and object storage system are used to improve the scalability of large-scale IO and solve the IO problems in data processing. Meanwhile, the IO load can be reduced by using the efficient compression algorithms of genomic data. In order to accelerate genomic data processing, algorithms such as neural networks can be used to optimize genome analysis methods, and FPGA or GPU heterogeneous computing can be used to improve the speed of data analysis. [Results] In brief, the above optimization can greatly improve HTC performance by solving the IO wall problem in genomic data analysis and improving the efficiency of HTC resources, which greatly reduces the computing time of genome-wide variation analysis. [Conclusions] The software and hardware improvements can significantly increase the HTC efficiency and speed in genomic data analysis, and can promote the application of high-throughput computing on large-scale cohort studies in the future.

Key words: high throughput computing, IO performance, genomic variation analysis, heterogeneous acceleration, data compression

Zeng Jingyao,Yuan Na,Wei Wenjuan,Li Gen,Du Zhenglin. Challenges of High-Throughput Computing in Genomic Data Analysis for Large-Scale Cohort Studies[J]. Frontiers of Data and Computing, 2020, 2(1): 117-127.

Figures/Tables 4

References 34

[1]	Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M . Comparison of next-generation sequencing systems[J]. J Biomed Biotechnol 2012, 2012: 251364.
[2]	Voelkerding KV, Dames SA, Durtschi JD . Next-generation sequencing: from basic research to diagnostics[J]. Clin Chem 2009, 55(4):641-658.
[3]	Heo Y . Improving quality of high-throughput sequencing reads. 2015.
[4]	Consortium UK, Walter K, Min JL, Huang J, Crooks L, Memari Y, McCarthy S, Perry JR, Xu C, Futema M et al. The UK10K project identifies rare variants in health and disease[J]. Nature 2015, 526(7571):82-90.
[5]	Turnbull C, Scott RH, Thomas E, Jones L, Murugaesu N, Pretty FB, Halai D, Baple E, Craig C, Hamblin A et al. The 100 000 Genomes Project: bringing whole genome sequencing to the NHS[J]. BMJ 2018, 361:k1687.
[6]	Gudbjartsson DF, Helgason H, Gudjonsson SA, Zink F, Oddson A, Gylfason A, Besenbacher S, Magnusson G, Halldorsson BV, Hjartarson E et al. Large-scale whole-genome sequencing of the Icelandic population[J]. Nat Genet 2015, 47(5):435-444.
[7]	Telenti A, Pierce LC, Biggs WH, di Iulio J, Wong EH, Fabani MM, Kirkness EF, Moustafa A, Shah N, Xie C et al. Deep sequencing of 10,000 human genomes[J]. Proc Natl Acad Sci U S A 2016, 113(42):11901-11906.
[8]	Nagasaki M, Yasuda J, Katsuoka F, Nariai N, Kojima K, Kawai Y, Yamaguchi-Kabata Y, Yokozawa J, Danjoh I, Saito S et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals[J]. Nat Commun 2015, 6:8018.
[9]	Chiang CWK, Mangul S, Robles C, Sankararaman S . A Comprehensive Map of Genetic Variation in the World's Largest Ethnic Group-Han Chinese[J]. Mol Biol Evol 2018, 35(11):2736-2750.
[10]	Lan T, Lin H, Zhu W, Laurent T, Yang M, Liu X, Wang J, Wang J, Yang H, Xu X et al. Deep whole-genome sequencing of 90 Han Chinese genomes[J]. Gigascience 2017, 6(9):1-7.
[11]	Du Z, Ma L, Qu H, Chen W, Zhang B, Lu X, Zhai W, Sheng X, Sun Y, Li W et al. Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome[J]. Genomics Proteomics Bioinformatics 2019, 17(3):229-247.
[12]	Schlagkamp S, Silva RFd, Deelman E, Schwiegelshohn U . Understanding User Behavior: From HPC to HTC[J]. Procedia Computer Science, 80:2241-2245.
[13]	Cabellos L, Campos I, Fernández-del-Castillo E, Owsiak M, Palak B, P?óciennik M . Scientific workflow orchestration interoperating HTC and HPC resources[J]. Computer Physics Communications, 182(4):890-897.
[14]	Sun Y, Wang X, Zhao X-G, Shi Z, Zhang L . First-principle high-throughput calculations of carrier effective masses of two-dimensional transition metal dichalcogenides[J]. Journal of Semiconductors 2018, 39(07):39-45.
[15]	Jing X, Xing H, Mao Z. On-chip structure and addressing scheme design for 2-D block data processing in a 64-core array system[C]. In: IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, VLSI-SoC 2011, Kowloon, Hong Kong, China, October 3-5, 2011: 2011.
[16]	Leggett RM, Ramirez-Gonzalez RH, Clavijo BJ, Waite D, Davey RP . Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics[J]. Front Genet 2013, 4:288.
[17]	Martin M . Cutadapt removes adapter sequences from high-throughput sequencing reads[J]. EMBnet.journal,2011, 17(1):3.
[18]	Li H . Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.ARXiv13033997Q-Bio. 2013.
[19]	McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M et al.The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data[J]. Genome Res 2010, 20(9):1297-1303.
[20]	Wang K, Li M, Hakonarson H . ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data[J]. Nucleic Acids Res 2010, 38(16):e164.
[21]	Wang Y, Li G, Ma M, He F, Song Z, Zhang W, Wu C . GT-WGS: an efficient and economic tool for large-scale WGS analyses based on the AWS cloud service[J]. BMC Genomics 2018, 19(Suppl 1):959.
[22]	Jones DC, Ruzzo WL, Peng X, Katze MG . Compression of next-generation sequencing reads aided by highly efficient de novo assembly[J]. Nucleic Acids Res 2012, 40(22):e171.
[23]	Zhang Y, Li L, Yang Y, Yang X, Zhu Z . Light-weight reference-based compression of FASTQ data[J]. BMC Bioinformatics 2015, 16(1):188.
[24]	Bonfield JK, Mahoney MV . Compression of FASTQ and SAM format sequencing data[J]. PLoS One 2013, 8(3):e59190.
[25]	Grumbach S, Tahi F. Compression of DNA sequences[C]. In: Data Compression Conference, 1993 DCC ' 93:1993.
[26]	Ziv J, Lempel A . A universal algorithm for sequential data compression[J]. IEEE Transactions on Information Theory 1977, 23(3):337-343.
[27]	Grumbach S, Tahi F . A new challenge for compression algorithms: Genetic sequences[J].Information Processing&Management 1994, 30(6):875-886.
[28]	Deorowicz S, Grabowski S . Compression of DNA sequence reads in FASTQ format[J]. Bioinformatics, 27(6):860-862.
[29]	Huffman DA . A method for the construction of minimum-redundancy codes[J]. Resonance 2006, 11(2):91-99.
[30]	Roguski U, Deorowicz S . DSRC 2—Industry-oriented compression of FASTQ files[J]. Bioinformatics, 30(15):2213-2215.
[31]	Hach F, Numanagic I, Alkan C, Sahinalp SC . SCALCE: boosting sequence compression algorithms using locally consistent encoding[J]. Bioinformatics, 28(23):3051-3057.
[32]	Xing Y, Li G, Wang Z, Feng B, Song Z, Wu C . GTZ: a fast compression and cloud transmission tool optimized for FASTQ files[J]. Bmc Bioinformatics, 18(S16):549.
[33]	Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the Inception Architecture for Computer Vision. In: arXiv e-prints. 2015.
[34]	Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going Deeper with Convolutions. In: arXiv e-prints. 2014.