基因组学数据分析方法现状和展望

doi:10.11871/jfdc.issn.2096-742X.2020.02.001

数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (2): 1-19.

doi: 10.11871/jfdc.issn.2096-742X.2020.02.001

所属专题： “数据分析技术与应用”专刊

• 专刊: 数据分析技术与应用 • 下一篇

基因组学数据分析方法现状和展望

陈梅丽¹,马英克¹,李茹姣^1,^*(),鲍一明^1,^2,^*()

^1. 中国科学院北京基因组研究所（国家生物信息中心）,国家基因组科学数据中心和中国科学院基因组科学与信息重点实验室,北京 100101
^2. 中国科学院大学,未来技术学院,北京 100049

收稿日期:2020-01-21 出版日期:2020-04-20 发布日期:2020-06-03
通讯作者: 李茹姣,鲍一明
作者简介:陈梅丽,中国科学院北京基因组研究所（国家生物信息中心）,国家基因组科学数据中心,助理研究员,博士,主要从事基因组、转录组等组学数据整合和挖掘工作。
参与完成文献调研和论文撰写,与马英克贡献相同。
Chen Meili, PhD., is a research assistant of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. Her research interests include integration and data mining of genomics and transcriptomics data.
For this paper she surveyed the literatures and drafted the manuscript.She and Ma Yingke contributed equally.
E-mail:chenml@big.ac.cn|马英克,中国科学院北京基因组研究所（国家生物信息中心）,国家基因组科学数据中心,助理研究员,博士,主要从事数据库开发、生物信息学软件开发相关研究。
参与完成文献调研和论文撰写,与陈梅丽贡献相同。
Ma Yingke, PhD., is a research assistant of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. Her research interests include database development and bioinformatics software development.
For this paper she surveyed the literatures and drafted the manuscript.She and ChenMeili contributed equally.
E-mail:mayk@big.ac.cn|李茹姣,中国科学院北京基因组研究所（国家生物信息中心）,国家基因组科学数据中心,高级工程师,博士,主要从事组学大数据整合和挖掘,2019年入选中国科学院关键技术人才。
参与完成文献调研和论文撰写,修改全文。
Li Rujiao, PhD., is a senior engineer of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. Her research interests include omics big data integration and mining. She was named one of 'CAS Key Technology Talent Program 2019'.
For this paper she surveyed the literatures, drafted and revised the manuscript.|鲍一明,现任中国科学院北京基因组研究所（国家生物信息中心）国家基因组科学数据中心主任、研究员、博士生导师。主要从事生物大数据整合与信息挖掘、病毒基因组注释和病毒进化与分类等研究。于1987年获得北京大学生物化学专业学士学位,1994年于英国John Innes中心（通过East Anglia大学）获遗传学博士学位。现为中国科学院大学健康医疗大数据国家研究院副院长,中国生物工程学会计算生物学与生物信息学专委会委员。
参与文章整体构思和设计,修改全文。
Bao Yiming, is the director and professor of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. His research interests include biology big data integration and mining, virus genome annotation and virus evolution and classification. He received B.S. degree from Peking University, Beijing, China in 1987, and Ph.D. degree from John Innes Centre (through University of East Anglia), UK, in 1994. Currently Dr. Bao is the deputy director of National Institute of Data Science in Health and Medicine, University of Chinese Academy of Sciences and a member of computational biology and bioinformatics specialized committee, Chinese Society of Biotechnology.
For this paper he conceived , designed and revised the paper.
基金资助:
国家重点研发计划“国际生命组学数据共享计划”(2016YFE0206600);国家重点研发计划“疾病组学数据兼容与整合”(2017YFC0908403);中国科学院战略性先导科技专项(B类)“多维大数据驱动的中国人群精准健康研究”(XDB38000000);中国科学院信息化专项 “大数据驱动的生物信息领域创新示范平台”(XXH13505-05);中国科学院率先行动“百人计划”

Current Status and Prospects of Genomics Data Analysis Methods

Chen Meili¹,Ma Yingke¹,Li Rujiao^1,^*(),Bao Yiming^1,^2,^*()

^1. National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences, Beijing 100101, China
^2. School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China

Received:2020-01-21 Online:2020-04-20 Published:2020-06-03
Contact: Rujiao Li,Yiming Bao

摘要/Abstract

摘要：

【目的】全面阐述基因组学数据分析方法的现状和未来发展趋势,为精准医学、精准育种、生物安全、生物多样性、分子进化等的相关组学数据分析算法的研究与工具开发提供参考。【结果】基因组学数据分析主要包括基因组、转录组、表观组数据分析,当前基因组学数据主要面临着海量、多维、异构等挑战。本文详细地阐述了基因组学数据分析算法和工具开发的现状、应用、存在的问题和面临的挑战。【结论】充分利用人工智能、统计模型、知识图谱等先进技术,不断地优化和开发更先进的算法和更鲁棒的模型,使其兼具高容错、高准确、高效、计算资源低耗等优点,匹配海量、多维、异构基因组学大数据分析的需求,是未来基因组学数据分析算法和工具开发的方向。

关键词: 基因组, 转录组, 表观组, 大数据分析, 多源异构数据整合

Abstract:

[Objective] Through a comprehensive review of the current status and future development of genomics data analysis methods, we provide suggestions for the improvement of algorithm and tool development of related omics data analysis in precision medicine, precision breeding, biosafety, biodiversity and molecular evolution. [Results] The analysis of genomics data mainly includes that of genomic, transcriptomic and epigenomic data. At present, the analysis of genomics data faces challenges primarily because the data are massive, multidimensional and heterogeneous. This review will elaborate on the current status, applications, challenges, and prospects of algorithm and tool development for genomics data analysis. [Conclusions] The future directions of algorithm and tool development for genomics data analysis are to make full use of advanced technologies such as artificial intelligence, statistical models, and knowledge graphs, and to continuously optimize and develop more advanced algorithms and robust models that are of error tolerance, high accuracy, and high efficiency with low cost of computing resources.

Key words: genome, transcriptome, epigenome, big data analysis, multi-source heterogeneous data integration

陈梅丽,马英克,李茹姣,鲍一明. 基因组学数据分析方法现状和展望[J]. 数据与计算发展前沿, 2020, 2(2): 1-19.

Chen Meili,Ma Yingke,Li Rujiao,Bao Yiming. Current Status and Prospects of Genomics Data Analysis Methods[J]. Frontiers of Data and Computing, 2020, 2(2): 1-19.

参考文献 139

[1]	Zhang L, Chen F, Zhang X, Li Z, Zhao Y, Lohaus R, Chang X, Dong W, Ho SYW, Liu X et al: The water lily genome and the early evolution of flowering plants[J]. Nature 2020,577(7788):79-84.
[2]	Yang N, Liu J, Gao Q, Gui S, Chen L, Yang L, Huang J, Deng T, Luo J, He L et al: Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement[J]. Nature Genetics 2019,51(6):1052-1059.
[3]	He M, Wang J, Fan X, Liu X, Shi W, Huang N, Zhao F, Miao M : Genetic basis for the establishment of endosymbiosis in Paramecium[J]. The ISME journal 2019,13(5):1360-1369.
[4]	Ruan J, Li H : Fast and accurate long-read assembly with wtdbg 2[J]. Nature Methods 2020,17(2):155-158.
[5]	Hu H, Mu Q, Bao Z, Chen Y, Liu Y, Chen J, Wang K, Wang Z, Nam Y, Jiang B , et al: Mutational Landscape of Secondary Glioblastoma Guides MET-Targeted Trial in Brain Tumor[J]. Cell 2018, 175(6):1665-1678. e18.
[6]	Zheng C, Zheng L, Yoo JK, Guo H, Zhang Y, Guo X, Kang B, Hu R, Huang JY, Zhang Q , et al: Landscape of Infiltrating T Cells in Liver Cancer Revealed by Single-Cell Sequencing[J]. Cell 2017, 169(7):1342-1356. e16.
[7]	Guo F, Yan L, Guo H, Li L, Hu B, Zhao Y, Yong J, Hu Y, Wang X, Wei Y et al: The Transcriptome and DNA Methylome Landscapes of Human Primordial Germ Cells[J]. Cell 2015,161(6):1437-1452.
[8]	Ledford H : Super-precise new CRISPR tool could tackle a plethora of genetic diseases[J]. Nature 2019,574(7779):464-465.
[9]	Zhang C, Chen Y, Sun B, Wang L, Yang Y, Ma D, Lv J, Heng J, Ding Y, Xue Y et al: m(6)A modulates haematopoietic stem and progenitor cell specification[J]. Nature 2017,549(7671):273-276.
[10]	Zhang W, Wan H, Feng G, Qu J, Wang J, Jing Y, Ren R, Liu Z, Zhang L, Chen Z et al: SIRT6 deficiency results in developmental retardation in cynomolgus monkeys[J]. Nature 2018,560(7720):661-665.
[11]	Deng Y, Zhai K, Xie Z, Yang D, Zhu X, Liu J, Wang X, Qin P, Yang Y, Zhang G et al: Epigenetic regulation of antagonistic receptors confers rice blast resistance with yield balance[J]. Science 2017,355(6328):962-965.
[12]	Li W, Zhu Z, Chern M, Yin J, Yang C, Ran L, Cheng M, He M, Wang K, Wang J , et al: A Natural Allele of a Transcription Factor in Rice Confers Broad-Spectrum Blast Resistance[J]. Cell 2017, 170(1):114-126. e15.
[13]	Efremova M, Teichmann SA : Computational methods for single-cell omics across modalities[J]. Nature Methods 2020,17(1):14-17.
[14]	Rackham OJL, Langley SR, Oates T, Vradi E, Harmston N, Srivastava PK, Behmoaras J, Dellaportas P, Bottolo L, Petretto E : A Bayesian Approach for Analysis of Whole-Genome Bisulfite Sequencing Data Identifies Disease-Associated Changes in DNA Methylation[J]. Genetics 2017,205(4):1443-1458.
[15]	Zhang Z, Pan Z, Ying Y, Xie Z, Adhikari S, Phillips J, Carstens RP, Black DL, Wu Y, Xing Y : Deep-learning augmented RNA-seq analysis of transcript splicing[J]. Nature Methods 2019,16(4):307-310. doi: 10.1038/s41592-019-0351-9
[16]	Tomczak K, Czerwinska P, Wiznerowicz M : The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge[J]. Contemporary Oncology 2015,19(1A):A68-77.
[17]	Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J et al: The UK Biobank resource with deep phenotyping and genomic data[J]. Nature 2018,562(7726):203-209.
[18]	Majoros WH, Pertea M, Salzberg SL : TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders[J]. Bioinformatics 2004,20(16):2878-2879.
[19]	Burn J, Watson M : The Human Variome Project[J]. Human Mutation 2016,37(6):505-507.
[20]	Zhao Y, Yin J, Guo H, Zhang Y, Xiao W, Sun C, Wu J, Qu X, Yu J, Wang X et al: The complete chloroplast genome provides insight into the evolution and polymorphism of Panax ginseng[J]. Frontiers in Plant Science 2014,5:696.
[21]	Zhang T, Zhang X, Hu S, Yu J : An efficient procedure for plant organellar genome assembly, based on whole genome data from the 454 GS FLX sequencing platform[J]. Plant Methods 2011,7:38.
[22]	Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y et al: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler[J]. Gigascience 2012,1(1):18.
[23]	Du Z, Ma L, Qu H, Chen W, Zhang B, Lu X, Zhai W, Sheng X, Sun Y, Li W et al: Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome[J]. Genomics, Proteomics & Bioinformatics 2019,17(3):229-247.
[24]	Wang X, Chen M, Xiao J, Hao L, Crowley DE, Zhang Z, Yu J, Huang N, Huo M, Wu J : Genome Sequence Analysis of the Naphthenic Acid Degrading and Metal Resistant Bacterium Cupriavidus gilardii CR3[J]. PLoS ONE 2015,10(8):e0132881. doi: 10.1371/journal.pone.0132881
[25]	Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J : Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions[J]. Nature Biotechnology 2013,31(12):1119-1125. doi: 10.1038/nbt.2727
[26]	Ghurye J, Rhie A, Walenz BP, Schmitt A, Selvaraj S, Pop M, Phillippy AM, Koren S : Integrating Hi-C links with assembly graphs for chromosome-scale assembly[J]. PLoS Computational Biology 2019,15(8):e1007273.
[27]	Dudchenko O, Batra SS, Omer AD, Nyquist SK, Hoeger M, Durand NC, Shamim MS, Machol I, Lander ES, Aiden AP et al: De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds[J]. Science 2017,356(6333):92-95. doi: 10.1126/science.aal3327
[28]	Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, Campbell MS, Stein JC, Wei X, Chin CS et al: Improved maize reference genome with single-molecule technologies[J]. Nature 2017,546(7659):524-527. doi: 10.1038/nature22971
[29]	Ribeiro FJ, Przybylski D, Yin S, Sharpe T, Gnerre S, Abouelleil A, Berlin AM, Montmayeur A, Shea TP, Walker BJ et al: Finished bacterial genomes from shotgun sequence data[J]. Genome Research 2012,22(11):2270-2277. doi: 10.1101/gr.141515.112
[30]	Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM : Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation[J]. Genome Research 2017,27(5):722-736. doi: 10.1101/gr.215087.116
[31]	Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O’Malley R, Figueroa-Balderas R, Morales-Cruz A et al: Phased diploid genome assembly with single-molecule real-time sequencing[J]. Nature Methods 2016,13(12):1050-1054. doi: 10.1038/nmeth.4035
[32]	Kolmogorov M, Yuan J, Lin Y, Pevzner PA : Assembly of long, error-prone reads using repeat graphs[J]. Nature Biotechnology 2019,37(5):540-546. doi: 10.1038/s41587-019-0072-8
[33]	Zhang X, Zhang S, Zhao Q, Ming R, Tang H : Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data[J]. Nature Plants 2019,5(8):833-845. doi: 10.1038/s41477-019-0487-8
[34]	Zhang J, Zhang X, Tang H, Zhang Q, Hua X, Ma X, Zhu F, Jones T, Zhu X, Bowers J et al: Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L[J]. Nature Genetics 2018,50(11):1565-1573. doi: 10.1038/s41588-018-0237-2
[35]	Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM : De novo assembly of haplotype-resolved genomes with trio binning[J]. Nature Biotechnology 2018,36(12):1174-1182. doi: 10.1038/nbt.4277
[36]	Kronenberg ZN, Hall RJ, Hiendleder S, Smith TPL, Sullivan ST, Williams JL, Kingan SB : FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes. bioRxiv 2018.
[37]	Duan Z, Qiao Y, Lu J, Lu H, Zhang W, Yan F, Sun C, Hu Z, Zhang Z, Li G et al: HUPAN: a pan-genome analysis pipeline for human genomes[J]. Genome Biology 2019,20(1):149. doi: 10.1186/s13059-019-1751-y
[38]	Besemer J, Lomsadze A, Borodovsky M : GeneMarkS: a self-training method for prediction of gene starts in microbial genomes[J]. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Research 2001,29(12):2607-2618. doi: 10.1093/nar/29.12.2607
[39]	Delcher AL, Bratke KA, Powers EC, Salzberg SL : Identifying bacterial genes and endosymbiont DNA with Glimmer[J]. Bioinformatics 2007,23(6):673-679. doi: 10.1093/bioinformatics/btm009
[40]	Boratyn GM, Schaffer AA, Agarwala R, Altschul SF, Lipman DJ, Madden TL : Domain enhanced lookup time accelerated BLAST[J]. Biology Direct 2012,7:12. doi: 10.1186/1745-6150-7-12
[41]	Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M et al: De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis[J]. Nature Protocols 2013,8(8):1494-1512. doi: 10.1038/nprot.2013.084
[42]	Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G et al: InterProScan 5: genome-scale protein function classification. Bioinformatics 2014,30(9):1236-1240. doi: 10.1093/bioinformatics/btu031
[43]	Alkan C, Coe BP, Eichler EE : Genome structural variation discovery and genotyping[J]. Nature Reviews Genetics 2011,12(5):363-376. doi: 10.1038/nrg2958
[44]	McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S Daly M , et al: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data[J]. Genome Research 2010,20(9):1297-1303. doi: 10.1101/gr.107524.110
[45]	Zhou A, Lin T, Xing J : Evaluating nanopore sequencing data processing pipelines for structural variation identification[J]. Genome Biology 2019,20(1):237. doi: 10.1186/s13059-019-1858-1
[46]	Genetic Modifiers of Huntington’s Disease C: Identification of Genetic Factors that Modify Clinical Onset of Huntington’s Disease[J]. Cell 2015,162(3):516-526. doi: 10.1016/j.cell.2015.07.003
[47]	Xiao Y, Liu H, Wu L, Warburton M, Yan J : Genome-wide Association Studies in Maize: Praise and Stargaze[J]. Molecular Plant 2017,10(3):359-374. doi: 10.1016/j.molp.2016.12.008
[48]	Sul JH, Martin LS, Eskin E : Population structure in genetic studies: Confounding factors and mixed models[J]. PLoS Genetics 2018,14(12):e1007309. doi: 10.1371/journal.pgen.1007309
[49]	Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A et al: Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies[J]. Nature Genetics 2018,50(9):1335-1341. doi: 10.1038/s41588-018-0184-y
[50]	Gong J, Wan H, Mei S, Ruan H, Zhang Z, Liu C, Guo AY, Diao L, Miao X, Han L : Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer[J]. Nucleic Acids Research 2019,47(D1):D1066-D1072. doi: 10.1093/nar/gky814
[51]	JD S : A direct approach to false discovery rates[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002,64:479-498. doi: 10.1111/rssb.2002.64.issue-3
[52]	Ongen H, Buil A, Brown AA, Dermitzakis ET, Delaneau O : Fast and efficient QTL mapper for thousands of molecular phenotypes[J]. Bioinformatics 2016,32(10):1479-1485. doi: 10.1093/bioinformatics/btv722
[53]	Hammond TR, Dufort C, Dissing-Olesen L, Giera S, Young A, Wysoker A, Walker AJ, Gergits F, Segel M, Nemesh J , et al: Single-Cell RNA Sequencing of Microglia throughout the Mouse Lifespan and in the Injured Brain Reveals Complex Cell-State Changes[J]. Immunity 2019, 50(1):253-271. e6. doi: 10.1016/j.immuni.2018.11.004
[54]	Marco-Puche G, Lois S, Benitez J, Trivino JC : RNA-Seq Perspectives to Improve Clinical Diagnosis[J]. Frontiers in Genetics 2019,10:1152. doi: 10.3389/fgene.2019.01152
[55]	Stark R, Grzelak M, Hadfield J : RNA sequencing: the teenage years[J]. Nature Reviews Genetics 2019,20(11):631-656. doi: 10.1038/s41576-019-0150-2
[56]	Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L : Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks[J]. Nature Protocols 2012,7(3):562-578. doi: 10.1038/nprot.2012.016
[57]	Li B, Dewey CN : RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome[J]. BMC Bioinformatics 2011,12:323. doi: 10.1186/1471-2105-12-323
[58]	Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C : Salmon provides fast and bias-aware quantification of transcript expression[J]. Nature Methods 2017,14(4):417-419. doi: 10.1038/nmeth.4197
[59]	Love MI, Huber W, Anders S : Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2[J]. Genome Biology 2014,15(12):550. doi: 10.1186/s13059-014-0550-8
[60]	Shen S, Park JW, Lu ZX, Lin L, Henry MD, Wu YN, Zhou Q, Xing Y : rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data[J]. Proceedings of the National Academy of Sciences of the United States of America, 2014,111(51):E5593-5601. doi: 10.1073/pnas.1419161111
[61]	Kim D, Salzberg SL : TopHat-Fusion: an algorithm for discovery of novel fusion transcripts[J]. Genome Biology 2011,12(8):R72. doi: 10.1186/gb-2011-12-8-r72
[62]	Wu HJ, Ma YK, Chen T, Wang M, Wang XJ : PsRobot: a web-based plant small RNA meta-analysis toolbox[J]. Nucleic Acids Research 2012,40(Web Server issue):W22-28. doi: 10.1093/nar/gks554
[63]	Fu S, Wang A, Au KF : A comparative evaluation of hybrid error correction methods for error-prone long reads[J]. Genome Biology 2019,20(1):26. doi: 10.1186/s13059-018-1605-z
[64]	Au KF, Underwood JG, Lee L, Wong WH : Improving PacBio long read accuracy by short read alignment[J]. PLoS ONE 2012,7(10):e46679. doi: 10.1371/journal.pone.0046679
[65]	Sharon D, Tilgner H, Grubert F, Snyder M : A single-molecule long-read survey of the human transcriptome[J]. Nature Biotechnology 2013,31(11):1009-1014. doi: 10.1038/nbt.2705
[66]	Rhoads A, Au KF : PacBio Sequencing and Its Applications[J]. Genomics, Proteomics & Bioinformatics 2015,13(5):278-289.
[67]	Volden R, Palmer T, Byrne A, Cole C, Schmitz RJ, Green RE, Vollmers C : Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA[J]. Proceedings of the National Academy of Sciences of the United States of America 2018,115(39):9726-9731.
[68]	Fu S, Ma Y, Yao H, Xu Z, Chen S, Song J, Au KF : IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing[J]. Bioinformatics 2018,34(13):2168-2176. doi: 10.1093/bioinformatics/bty098
[69]	Weirather JL, Afshar PT, Clark TA, Tseng E, Powers LS, Underwood JG, Zabner J, Korlach J, Wong WH, Au KF : Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing[J]. Nucleic Acids Research 2015,43(18):e116. doi: 10.1093/nar/gkv562
[70]	Deonovic B, Wang Y, Weirather J, Wang XJ, Au KF : IDP-ASE: haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing[J]. Nucleic Acids Research 2017,45(5):e32. doi: 10.1093/nar/gkw1076
[71]	Garalde DR, Snell EA, Jachimowicz D, Sipos B, Lloyd JH, Bruce M, Pantic N, Admassu T, James P, Warland A et al: Highly parallel direct RNA sequencing on an array of nanopores[J]. Nature Methods 2018,15(3):201-206. doi: 10.1038/nmeth.4577
[72]	Ilicic T, Kim JK, Kolodziejczyk AA, Bagger FO, McCarthy DJ, Marioni JC, Teichmann SA : Classification of low quality cells from single-cell RNA-seq data[J]. Genome Biology 2016,17:29. doi: 10.1186/s13059-016-0888-1
[73]	Lun AT, Bach K, Marioni JC : Pooling across cells to normalize single-cell RNA sequencing data with many zero counts[J]. Genome Biology 2016,17:75. doi: 10.1186/s13059-016-0947-7
[74]	Cole MB, Risso D, Wagner A, DeTomaso D, Ngai J, Purdom E, Dudoit S, Yosef N : Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq[J]. Cell Systems 2019, 8(4):315-328. e8. doi: 10.1016/j.cels.2019.03.010
[75]	Brennecke P, Anders S, Kim JK, Kolodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC et al: Accounting for technical noise in single-cell RNA-seq experiments[J]. Nature Methods 2013,10(11):1093-1095. doi: 10.1038/NMETH.2645
[76]	Satija R, Farrell JA, Gennert D, Schier AF, Regev A : Spatial reconstruction of single-cell gene expression data[J]. Nature Biotechnology 2015,33(5):495-502. doi: 10.1038/nbt.3192
[77]	Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E : Fast unfolding of communities in large networks[J]. Journal of Statistical Mechanics: Theory and Experiment 2011,83(3):036103.
[78]	Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA : The Human Cell Atlas: from vision to reality[J]. Nature 2017,550(7677):451-453. doi: 10.1038/550451a
[79]	Saelens W, Cannoodt R, Todorov H, Saeys Y : A comparison of single-cell trajectory inference methods[J]. Nature Biotechnology 2019,37(5):547-554. doi: 10.1038/s41587-019-0071-9
[80]	Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, Vert JP, Robinson MD, Dudoit S, Clement L : Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications[J]. Genome Biology 2018,19(1):24. doi: 10.1186/s13059-018-1406-4
[81]	Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH : An integrated software system for analyzing ChIP-chip and ChIP-seq data[J]. Nature Biotechnology 2008,26(11):1293-1300. doi: 10.1038/nbt.1505
[82]	Jothi R, Cuddapah S, Barski A, Cui K, Zhao K : Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data[J]. Nucleic Acids Research 2008,36(16):5221-5231. doi: 10.1093/nar/gkn488
[83]	Bardet AF, Steinmann J, Bafna S, Knoblich JA, Zeitlinger J, Stark A : Identification of transcription factor binding sites from ChIP-seq data at high resolution[J]. Bioinformatics 2013,29(21):2705-2713. doi: 10.1093/bioinformatics/btt470
[84]	Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W et al: Model-based Analysis of ChIP-Seq (MACS)[J]. Genome Biology 2008,9(9).
[85]	Boyle AP, Guinney J, Crawford GE, Furey TS : F-Seq: a feature density estimator for high-throughput sequence tags[J]. Bioinformatics 2008,24(21):2537-2538. doi: 10.1093/bioinformatics/btn480
[86]	Zhang X, Robertson G, Krzywinski M, Ning K, Droit A, Jones S, Gottardo R : PICS: probabilistic inference for ChIP-seq[J]. Biometrics 2011,67(1):151-163. doi: 10.1111/j.1541-0420.2010.01441.x
[87]	Angarica VE, Del Sol A : Bioinformatics Tools for Genome-Wide Epigenetic Research[J]. Advances in Experimental Medicine and Biology 2017,978:489-512.
[88]	Du P, Kibbe WA, Lin SM : lumi: a pipeline for processing Illumina microarray[J]. Bioinformatics 2008,24(13):1547-1548. doi: 10.1093/bioinformatics/btn224
[89]	Barfield RT, Kilaru V, Smith AK, Conneely KN : CpGassoc: an R function for analysis of DNA methylation microarray data[J]. Bioinformatics 2012,28(9):1280-1281. doi: 10.1093/bioinformatics/bts124
[90]	Li H, Durbin R : Fast and accurate short read alignment with Burrows-Wheeler transform[J]. Bioinformatics 2009,25(14):1754-1760. doi: 10.1093/bioinformatics/btp324
[91]	Langmead B, Trapnell C, Pop M, Salzberg SL : Ultrafast and memory-efficient alignment of short DNA sequences to the human genome[J]. Genome Biology 2009,10(3):R25. doi: 10.1186/gb-2009-10-3-r25
[92]	Krueger F, Andrews SR : Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications[J]. Bioinformatics 2011,27(11):1571-1572. doi: 10.1093/bioinformatics/btr167
[93]	Liang F, Tang B, Wang Y, Wang J, Yu C, Chen X, Zhu J, Yan J, Zhao W, Li R : WBSA: web service for bisulfite sequencing data analysis[J]. PLoS ONE 2014,9(1):e86707. doi: 10.1371/journal.pone.0086707
[94]	Huang KYY, Huang YJ, Chen PY : BS-Seeker3: ultrafast pipeline for bisulfite sequencing[J]. BMC Bioinformatics 2018,19(1):111. doi: 10.1186/s12859-018-2120-7
[95]	Wu P, Gao Y, Guo WL, Zhu P : Using local alignment to enhance single-cell bisulfite sequencing data efficiency[J]. Bioinformatics 2019,35(18):3273-3278. doi: 10.1093/bioinformatics/btz125
[96]	Lea AJ, Tung J, Zhou X : A Flexible, Efficient Binomial Mixed Model for Identifying Differential DNA Methylation in Bisulfite Sequencing Data[J]. PLoS Genetics 2015,11(11):e1005650. doi: 10.1371/journal.pgen.1005650
[97]	Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE : methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles[J]. Genome Biology 2012,13(10):R87. doi: 10.1186/gb-2012-13-10-r87
[98]	Sun DQ, Xi YX, Rodriguez B, Park HJ, Tong P, Meong M, Goodell MA, Li W : MOABS: model based analysis of bisulfite sequencing data[J]. Genome Biology 2014,15(2).
[99]	Juhling F, Kretzmer H, Bernhart SH, Otto C, Stadler PF, Hoffmann S : metilene: fast and sensitive calling of differentially methylated regions from bisulfite sequencing data[J]. Genome Research 2016,26(2):256-262. doi: 10.1101/gr.196394.115
[100]	Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B et al: The accessible chromatin landscape of the human genome[J]. Nature 2012,489(7414):75-82. doi: 10.1038/nature11232
[101]	Liu L, Xie J, Sun X, Luo K, Qin ZS, Liu H : An approach of identifying differential nucleosome regions in multiple samples. BMC Genomics 2017,18(1):135. doi: 10.1186/s12864-017-3541-9
[102]	Buitrago D, Codo L, Illa R, de Jorge P, Battistini F, Flores O, Bayarri G, Royo R, Del Pino M, Heath S et al: Nucleosome Dynamics: a new tool for the dynamic analysis of nucleosome positioning[J]. Nucleic Acids Research 2019,47(18):9511-9523. doi: 10.1093/nar/gkz759
[103]	Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, Filippova GN, Huang X, Christiansen L, DeWitt WS , et al: A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility[J]. Cell 2018, 174(5):1309-1324. e1318. doi: 10.1016/j.cell.2018.06.052
[104]	Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, Dekker J, Mirny LA : Iterative correction of Hi-C data reveals hallmarks of chromosome organization[J]. Nature Methods 2012,9(10):999-1003. doi: 10.1038/NMETH.2148
[105]	Durand NC, Shamim MS, Machol I, Rao SS, Huntley MH, Lander ES, Aiden EL : Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments[J]. Cell Systems 2016,3(1):95-98. doi: 10.1016/j.cels.2016.07.002
[106]	Li A, Yin X, Xu B, Wang D, Han J, Wei Y, Deng Y, Xiong Y, Zhang Z : Decoding topologically associating domains with ultra-low resolution Hi-C data by graph structural entropy[J]. Nature Communications 2018,9(1):3265. doi: 10.1038/s41467-018-05691-7
[107]	Cournac A, Marie-Nelly H, Marbouty M, Koszul R, Mozziconacci J : Normalization of a chromosomal contact map[J]. BMC Genomics 2012,13.
[108]	Wolff J, Bhardwaj V, Nothjunge S, Richard G, Renschler G, Gilsbach R, Manke T, Backofen R, Ramirez F, Gruning BA : Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization[J]. Nucleic Acids Research 2018,46(W1):W11-W16. doi: 10.1093/nar/gky504
[109]	Zheng XB, Zheng YX : CscoreTool: fast Hi-C compartment analysis at high resolution[J]. Bioinformatics 2018,34(9):1568-1570. doi: 10.1093/bioinformatics/btx802
[110]	Norton HK, Emerson DJ, Huang H, Kim J, Titus KR, Gu S, Bassett DS, Phillips-Cremins JE : Detecting hierarchical genome folding with network modularity[J]. Nature Methods 2018,15(2):119-122. doi: 10.1038/nmeth.4560
[111]	Chen FL, Li GP, Zhang MQ, Chen Y : HiCDB: a sensitive and robust method for detecting contact domain boundaries[J]. Nucleic Acids Research 2018,46(21):11239-11250. doi: 10.1093/nar/gky789
[112]	Schwarzer W, Abdennur N, Goloborodko A, Pekowska A, Fudenberg G, Loe-Mie Y, Fonseca NA, Huber W, Haering CH, Mirny L et al: Two independent modes of chromatin organization revealed by cohesin removal[J]. Nature 2017,551(7678):51-56. doi: 10.1038/nature24281
[113]	Xu Z, Zhang G, Wu C, Li Y, Hu M : FastHiC: a fast and accurate algorithm to detect long-range chromosomal interactions from Hi-C data[J]. Bioinformatics 2016,32(17):2692-2695. doi: 10.1093/bioinformatics/btw240
[114]	Ron G, Globerson Y, Moran D, Kaplan T : Promoter-enhancer interactions identified from Hi-C data using probabilistic models and hierarchical topological domains[J]. Nature Communications 2017,8(1):2237. doi: 10.1038/s41467-017-02386-3
[115]	Paulsen J, Sandve GK, Gundersen S, Lien TG, Trengereid K, Hovig E : HiBrowse: multi-purpose statistical analysis of genome-wide chromatin 3D organization[J]. Bioinformatics 2014,30(11):1620-1622. doi: 10.1093/bioinformatics/btu082
[116]	Akdemir KC, Chin L : HiCPlotter integrates genomic data with interaction matrices[J]. Genome Biology 2015,16:198. doi: 10.1186/s13059-015-0767-1
[117]	Szalaj P, Michalski PJ, Wroblewski P, Tang Z, Kadlof M, Mazzocco G, Ruan Y, Plewczynski D : 3D-GNOME: an integrated web service for structural modeling of the 3D genome[J]. Nucleic Acids Research 2016,44(W1):W288-293. doi: 10.1093/nar/gkw437
[118]	Nadhir DM, Mengjie W, Q. ZM, Juntao G : HiC-3DViewer: a new tool to visualize Hi-C data in 3D space[J]. Quantitative Biology 2017,5(2):183-190. doi: 10.1007/s40484-017-0091-8
[119]	Wang Y, Song F, Zhang B, Zhang L, Xu J, Kuang D, Li D, Choudhary MNK, Li Y, Hu M et al: The 3D Genome Browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions[J]. Genome Biology 2018,19(1):151. doi: 10.1186/s13059-018-1519-9
[120]	Tang B, Li F, Li J, Zhao W, Zhang Z : Delta: a new web-based 3D genome visualization and analysis platform[J]. Bioinformatics 2018,34(8):1409-1410. doi: 10.1093/bioinformatics/btx805
[121]	Calandrelli R, Wu Q, Guan J, Zhong S : GITAR: An Open Source Tool for Analysis and Visualization of Hi-C Data[J]. Genomics, Proteomics & Bioinformatics 2018,16(5):365-372.
[122]	Stansfield JC, Cresswell KG, Vladimirov VI, Dozmorov MG : HiCcompare: an R-package for joint normalization and comparison of HI-C datasets[J]. BMC Bioinformatics 2018,19(1):279. doi: 10.1186/s12859-018-2288-x
[123]	Trieu T, Oluwadare O, Wopata J, Cheng J : GenomeFlow: a comprehensive graphical tool for modeling and analyzing 3D genome structure[J]. Bioinformatics 2019,35(8):1416-1418. doi: 10.1093/bioinformatics/bty802
[124]	Lu F, Wei Z, Luo Y, Guo H, Zhang G, Xia Q, Wang Y : SilkDB 3.0: visualizing and exploring multiple levels of data for silkworm[J]. Nucleic Acids Research 2020,48(D1):D749-D755.
[125]	Pal K, Forcato M, Ferrari F : Hi-C analysis: from data generation to integration[J]. Biophysical Reviews 2019,11(1):67-78. doi: 10.1007/s12551-018-0489-1
[126]	Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR et al: The NIH Roadmap Epigenomics Mapping Consortium[J]. Nature Biotechnology 2010,28(10):1045-1048. doi: 10.1038/nbt1010-1045
[127]	Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ : Gapped BLAST and PSI-BLAST: a new generation of protein database search programs[J]. Nucleic Acids Research 1997,25(17):3389-3402. doi: 10.1093/nar/25.17.3389
[128]	Kent WJ : BLAT--the BLAST-like alignment tool[J]. Genome Research 2002,12(4):656-664. doi: 10.1101/gr.229202
[129]	Langmead B, Salzberg SL : Fast gapped-read alignment with Bowtie 2[J]. Nature Methods 2012,9(4):357-359. doi: 10.1038/NMETH.1923
[130]	Kim D, Paggi JM, Park C, Bennett C, Salzberg SL : Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype[J]. Nature Biotechnology 2019,37(8):907-915. doi: 10.1038/s41587-019-0201-4
[131]	Hill MD, Marty MR : Amdahl’s law in the multicore era[J]. Computer 2008,41(7):33-38.
[132]	Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S : A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data[J]. Bioinformatics 2013,29(2):189-196. doi: 10.1093/bioinformatics/bts680
[133]	Wang J, Agarwal D, Huang M, Hu G, Zhou Z, Ye C, Zhang NR : Data denoising with transfer learning in single-cell transcriptomics[J]. Nature Methods 2019,16(9):875-878. doi: 10.1038/s41592-019-0537-1
[134]	Wilson CM, Li K, Yu X, Kuan PF, Wang X : Multiple-kernel learning for genomic data mining and prediction[J]. BMC Bioinformatics 2019,20(1):426. doi: 10.1186/s12859-019-2992-1
[135]	Dinov ID, Heavner B, Tang M, Glusman G, Chard K, Darcy M, Madduri R, Pa J, Spino C, Kesselman C et al: Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations[J]. PLoS ONE 2016,11(8):e0157077. doi: 10.1371/journal.pone.0157077
[136]	Zheng J, Wang K : Emerging deep learning methods for single-cell RNA-seq data analysis[J]. Quantitative Biology 2019,7(4):247-254. doi: 10.1007/s40484-019-0189-2
[137]	Franzosa EA, McIver LJ, Rahnavard G, Thompson LR, Schirmer M, Weingart G, Lipson KS, Knight R, Caporaso JG, Segata N et al: Species-level functional profiling of metagenomes and metatranscriptomes[J]. Nature Methods 2018,15(11):962-968. doi: 10.1038/s41592-018-0176-y
[138]	Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J : BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics 2020,36(4):1234-1240.
[139]	National Genomics Data Center Members and Partners: Database Resources of the National Genomics Data Center in 2020[J]. Nucleic Acids Research 2020,48(D1):D24-D33.

基因组学数据分析方法现状和展望

Current Status and Prospects of Genomics Data Analysis Methods

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献 139

相关文章 3

编辑推荐

Metrics

本文评价

[1]	陈文杰,胡正银,胡靖,庞弘燊,何雨娟. 多维数据驱动的粮食安全分析与智能决策系统研究与实践[J]. 数据与计算发展前沿, 2021, 3(6): 1-14.
[2]	张舒莹,韩鑫胤,何小雨,袁丹阳,栾海晶,李瑞琳,何佳茵,牛北方. 基于机器学习的基因组微卫星状态探测方法综述[J]. 数据与计算发展前沿, 2021, 3(3): 126-135.
[3]	曾瀞瑶,苑娜,魏文娟,李根,杜政霖. 高通量计算在大规模人群队列基因组数据解析应用中的挑战[J]. 数据与计算发展前沿, 2020, 2(1): 117-127.