数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (4): 59-76.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.005

doi: 10.11871/jfdc.issn.2096-742X.2024.04.005

• 专刊:面向国家科学数据中心的基础软件栈及系统 • 上一篇    下一篇

基于参考的基因序列压缩算法综述

蔡佳威1,2(),胡川1,2,王华进1,2,沈志宏1,2,*()   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院大学,北京 100049
  • 收稿日期:2024-01-15 出版日期:2024-08-20 发布日期:2024-08-20
  • 通讯作者: *沈志宏(E-mail: bluejoe@cnic.cn
  • 作者简介:蔡佳威,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,主要研究方向为图数据库、数据管理。
    本文承担工作为:文献调研、实验数据收集、实验设计、论文写作。
    CAI Jiawei is a Master’s student at the Computer Network Information Center, Chinese Academy of Sciences, and the University of Chinese Academy of Sciences. His main research interests are graph databases and data management.
    In this paper, he undertakes the following tasks: literature review, experimental data collection, experimental design, and paper writing.
    E-mail: jwcai@cnic.cn|沈志宏,中国科学院计算机网络信息中心,大数据技术与应用发展部主任,同时担任大数据分析与计算技术国家地方联合工程实验室总工程师。正研级高级工程师,博士生导师,CCF会员。主要研究方向为科学大数据、图数据管理技术。主持开发了大数据流水线PiFlow、异构数据融合管理系统PandaDB等开源软件。
    本文承担工作为:论文整体框架设计,研究指导。
    SHEN Zhihong, Ph.D., professor, doctoral supervisor. He is the director of the Department of Big Data Technology and Application Development at the Computer Network Information Center, Chinese Academy of Sciences. He simultaneously serves as the Chief Engineer of the National-Local Joint Engineering Laboratory for Big Data Analysis and Computing Technology. He holds the position of Senior Engineer at the research level, doctoral supervisor, and member of China Computer Federation (CCF). His main research focuses on scientific big data and graph data management technology. He has led the development of open-source software such as the big data pipeline PiFlow and the heterogeneous data fusion management system PandaDB.
    In this paper, he undertakes the following tasks: the overall framework design and research guidance of the thesis.
    E-mail: bluejoe@cnic.cn
  • 基金资助:
    国家重点研发计划项目“面向国家科学数据中心的基础软件栈及系统”(2021YFF0704200);中国科学院“十四五”网信专项工程建设项目“科学大数据工程(三期)”(CAS-WX2022GC-02)

A Survey on Gene Sequence Compression Algorithms Based on Reference Sequences

CAI Jiawei1,2(),HU Chuan1,2,WANG Huajin1,2,SHEN Zhihong1,2,*()   

  1. 1. Computer Network Information Center, The Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-01-15 Online:2024-08-20 Published:2024-08-20

摘要:

【背景】 在过去的二十年里,DNA测序技术持续发展,海量生物序列数据的产生给数据存储、管理和传输带来了严峻的挑战。【目的】 本文主要总结近十五年基于参考的基因序列压缩算法,以寻求加速生物数据共享和降低存储成本的方法。【方法】 本文从算法的发展角度出发,按照不同算法所使用的关键技术和针对压缩优化的方案进行分类。通过实验验证当前主流算法的性能,揭示当前基于参考的压缩算法所存在的问题。提出一些值得探讨的研究方向,并对未来的研究方向进行了展望。【结果】 本文分析了已有基于参考的基因序列压缩算法使用的技术,包括基于单核苷酸多态性、检测最大精确匹配、分段/分块处理和基于LZ77等技术。并对几种较著名的算法进行了复现,发现这些算法倾向于在基准数据集上表现出高压缩比,但在普通数据集上的压缩比普遍不高。【结论】 目前已有的基于参考的基因序列压缩算法在理论上可以加速数据传输效率、节约存储成本,但是实用性存疑。须继续改进公共子序列匹配方式以提升对普通数据集的支持,增加预处理参考序列步骤以降低匹配时间开销。

关键词: 参考序列, 基因压缩, DNA序列

Abstract:

[Background] Over the past two decades, DNA sequencing technologies have continued to advance, leading to the generation of massive biological data and posing significant challenges to data storage, management, and transmission. [Objective] This paper aims to provide a comprehensive survey of reference-based gene sequence compression algorithms developed in the last fifteen years, seeking methods to expedite the sharing of biological data and reduce storage costs. [Methods] The paper classifies algorithms based on their development perspective, categorizing them according to the key technologies employed and optimization strategies for compression. Performance verification experiments are conducted to reveal existing issues with current reference-based compression algorithms. The paper also proposes some research directions for further exploration and offers insights into future research. [Results] The analysis covers the technologies utilized by existing reference-based gene compression algorithms, including those based on single nucleotide polymorphisms, detection of maximum exact matches, segment/block processing, and LZ77-based techniques. Several well-known algorithms are reproduced, revealing their tendency to exhibit high compression ratios on benchmark datasets but generally lower compression ratios on ordinary datasets. [Conclusions] Theoretically, currently available reference-based gene sequence compression algorithms have the potential to accelerate data transmission efficiency and save storage costs. However, their practicality remains questionable. Further improvements are needed in matching common subsequences to enhance support to ordinary datasets and to reduce matching time overhead by introducing preprocessing steps for reference sequences.

Key words: reference sequences, gene compression, DNA sequences