Frontiers of Data and Computing ›› 2024, Vol. 6 ›› Issue (4): 59-76.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.04.005

doi: 10.11871/jfdc.issn.2096-742X.2024.04.005

• Special Issue: Fundamental Software Stack and Systems for National Scientific Data Centers • Previous Articles     Next Articles

A Survey on Gene Sequence Compression Algorithms Based on Reference Sequences

CAI Jiawei1,2(),HU Chuan1,2,WANG Huajin1,2,SHEN Zhihong1,2,*()   

  1. 1. Computer Network Information Center, The Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-01-15 Online:2024-08-20 Published:2024-08-20

Abstract:

[Background] Over the past two decades, DNA sequencing technologies have continued to advance, leading to the generation of massive biological data and posing significant challenges to data storage, management, and transmission. [Objective] This paper aims to provide a comprehensive survey of reference-based gene sequence compression algorithms developed in the last fifteen years, seeking methods to expedite the sharing of biological data and reduce storage costs. [Methods] The paper classifies algorithms based on their development perspective, categorizing them according to the key technologies employed and optimization strategies for compression. Performance verification experiments are conducted to reveal existing issues with current reference-based compression algorithms. The paper also proposes some research directions for further exploration and offers insights into future research. [Results] The analysis covers the technologies utilized by existing reference-based gene compression algorithms, including those based on single nucleotide polymorphisms, detection of maximum exact matches, segment/block processing, and LZ77-based techniques. Several well-known algorithms are reproduced, revealing their tendency to exhibit high compression ratios on benchmark datasets but generally lower compression ratios on ordinary datasets. [Conclusions] Theoretically, currently available reference-based gene sequence compression algorithms have the potential to accelerate data transmission efficiency and save storage costs. However, their practicality remains questionable. Further improvements are needed in matching common subsequences to enhance support to ordinary datasets and to reduce matching time overhead by introducing preprocessing steps for reference sequences.

Key words: reference sequences, gene compression, DNA sequences