数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (2): 1-19.

doi: 10.11871/jfdc.issn.2096-742X.2020.02.001

所属专题: “数据分析技术与应用”专刊

• 专刊: 数据分析技术与应用 •    下一篇

基因组学数据分析方法现状和展望

陈梅丽1,马英克1,李茹姣1,*(),鲍一明1,2,*()   

  1. 1. 中国科学院北京基因组研究所(国家生物信息中心),国家基因组科学数据中心和中国科学院基因组科学与信息重点实验室,北京 100101
    2. 中国科学院大学,未来技术学院,北京 100049
  • 收稿日期:2020-01-21 出版日期:2020-04-20 发布日期:2020-06-03
  • 通讯作者: 李茹姣,鲍一明
  • 作者简介:陈梅丽,中国科学院北京基因组研究所(国家生物信息中心),国家基因组科学数据中心,助理研究员,博士,主要从事基因组、转录组等组学数据整合和挖掘工作。
    参与完成文献调研和论文撰写,与马英克贡献相同。
    Chen Meili, PhD., is a research assistant of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. Her research interests include integration and data mining of genomics and transcriptomics data.
    For this paper she surveyed the literatures and drafted the manuscript.She and Ma Yingke contributed equally.
    E-mail:chenml@big.ac.cn|马英克,中国科学院北京基因组研究所(国家生物信息中心),国家基因组科学数据中心,助理研究员,博士,主要从事数据库开发、生物信息学软件开发相关研究。
    参与完成文献调研和论文撰写,与陈梅丽贡献相同。
    Ma Yingke, PhD., is a research assistant of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. Her research interests include database development and bioinformatics software development.
    For this paper she surveyed the literatures and drafted the manuscript.She and ChenMeili contributed equally.
    E-mail:mayk@big.ac.cn|李茹姣,中国科学院北京基因组研究所(国家生物信息中心),国家基因组科学数据中心,高级工程师,博士,主要从事组学大数据整合和挖掘,2019年入选中国科学院关键技术人才。
    参与完成文献调研和论文撰写,修改全文。
    Li Rujiao, PhD., is a senior engineer of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. Her research interests include omics big data integration and mining. She was named one of 'CAS Key Technology Talent Program 2019'.
    For this paper she surveyed the literatures, drafted and revised the manuscript.|鲍一明,现任中国科学院北京基因组研究所(国家生物信息中心)国家基因组科学数据中心主任、研究员、博士生导师。主要从事生物大数据整合与信息挖掘、病毒基因组注释和病毒进化与分类等研究。于1987年获得北京大学生物化学专业学士学位,1994年于英国John Innes中心(通过East Anglia大学)获遗传学博士学位。现为中国科学院大学健康医疗大数据国家研究院副院长,中国生物工程学会计算生物学与生物信息学专委会委员。
    参与文章整体构思和设计,修改全文。
    Bao Yiming, is the director and professor of National Genomics Data Center, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences. His research interests include biology big data integration and mining, virus genome annotation and virus evolution and classification. He received B.S. degree from Peking University, Beijing, China in 1987, and Ph.D. degree from John Innes Centre (through University of East Anglia), UK, in 1994. Currently Dr. Bao is the deputy director of National Institute of Data Science in Health and Medicine, University of Chinese Academy of Sciences and a member of computational biology and bioinformatics specialized committee, Chinese Society of Biotechnology.
    For this paper he conceived , designed and revised the paper.
  • 基金资助:
    国家重点研发计划“国际生命组学数据共享计划”(2016YFE0206600);国家重点研发计划“疾病组学数据兼容与整合”(2017YFC0908403);中国科学院战略性先导科技专项(B类)“多维大数据驱动的中国人群精准健康研究”(XDB38000000);中国科学院信息化专项 “大数据驱动的生物信息领域创新示范平台”(XXH13505-05);中国科学院率先行动“百人计划”

Current Status and Prospects of Genomics Data Analysis Methods

Chen Meili1,Ma Yingke1,Li Rujiao1,*(),Bao Yiming1,2,*()   

  1. 1. National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences, Beijing 100101, China
    2. School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2020-01-21 Online:2020-04-20 Published:2020-06-03
  • Contact: Rujiao Li,Yiming Bao

摘要:

【目的】全面阐述基因组学数据分析方法的现状和未来发展趋势,为精准医学、精准育种、生物安全、生物多样性、分子进化等的相关组学数据分析算法的研究与工具开发提供参考。【结果】基因组学数据分析主要包括基因组、转录组、表观组数据分析,当前基因组学数据主要面临着海量、多维、异构等挑战。本文详细地阐述了基因组学数据分析算法和工具开发的现状、应用、存在的问题和面临的挑战。【结论】充分利用人工智能、统计模型、知识图谱等先进技术,不断地优化和开发更先进的算法和更鲁棒的模型,使其兼具高容错、高准确、高效、计算资源低耗等优点,匹配海量、多维、异构基因组学大数据分析的需求,是未来基因组学数据分析算法和工具开发的方向。

关键词: 基因组, 转录组, 表观组, 大数据分析, 多源异构数据整合

Abstract:

[Objective] Through a comprehensive review of the current status and future development of genomics data analysis methods, we provide suggestions for the improvement of algorithm and tool development of related omics data analysis in precision medicine, precision breeding, biosafety, biodiversity and molecular evolution. [Results] The analysis of genomics data mainly includes that of genomic, transcriptomic and epigenomic data. At present, the analysis of genomics data faces challenges primarily because the data are massive, multidimensional and heterogeneous. This review will elaborate on the current status, applications, challenges, and prospects of algorithm and tool development for genomics data analysis. [Conclusions] The future directions of algorithm and tool development for genomics data analysis are to make full use of advanced technologies such as artificial intelligence, statistical models, and knowledge graphs, and to continuously optimize and develop more advanced algorithms and robust models that are of error tolerance, high accuracy, and high efficiency with low cost of computing resources.

Key words: genome, transcriptome, epigenome, big data analysis, multi-source heterogeneous data integration