Dynamic Alignment-Free and Reference-Free Read Compression

Holley G, Wittler R, Stoye J, Hach F (2018)

Es wurde kein Volltext hochgeladen. Nur Publikationsnachweis!
Zeitschriftenaufsatz | Veröffentlicht | Englisch
Abstract / Bemerkung
The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.
21st Annual International Conference on Research in Computational Molecular Biology (RECOMB)
Hong Kong, HONG KONG


Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY. 2018;25(7):825-836.
Holley, G., Wittler, R., Stoye, J., & Hach, F. (2018). Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY, 25(7), 825-836. doi:10.1089/cmb.2018.0068
Holley, G., Wittler, R., Stoye, J., and Hach, F. (2018). Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY 25, 825-836.
Holley, G., et al., 2018. Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY, 25(7), p 825-836.
G. Holley, et al., “Dynamic Alignment-Free and Reference-Free Read Compression”, JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 25, 2018, pp. 825-836.
Holley, G., Wittler, R., Stoye, J., Hach, F.: Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY. 25, 825-836 (2018).
Holley, Guillaume, Wittler, Roland, Stoye, Jens, and Hach, Faraz. “Dynamic Alignment-Free and Reference-Free Read Compression”. JOURNAL OF COMPUTATIONAL BIOLOGY 25.7 (2018): 825-836.

1 Zitation in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

33 References

Daten bereitgestellt von Europe PubMed Central.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G., BMC Bioinformatics 16(), 2015
PMID: 26370285

Compression of FASTQ and SAM format sequencing data.
Bonfield JK, Mahoney MV., PLoS ONE 8(3), 2013
PMID: 23533605

Burrows, Digital SRC Research Report (), 1994
How to apply de Bruijn graphs to genome assembly.
Compeau PE, Pevzner PA, Tesler G., Nat. Biotechnol. 29(11), 2011
PMID: 22068540
Data compression for sequencing data.
Deorowicz S, Grabowski S., Algorithms Mol Biol 8(1), 2013
PMID: 24252160
Closure of the NCBI SRA and implications for the long-term future of genomics data storage.
Lipman D, Flicek P, Salzberg S, Gerstein M, Knight R., Genome Biol. 12(3), 2011
PMID: 21418618
Disk-based compression of data from genome sequencing.
Grabowski S, Deorowicz S, Roguski L., Bioinformatics 31(9), 2014
PMID: 25536966
SCALCE: boosting sequence compression algorithms using locally consistent encoding.
Hach F, Numanagic I, Alkan C, Sahinalp SC., Bioinformatics 28(23), 2012
PMID: 23047557
Sequence squeeze: an open contest for sequence compression.
Holland RC, Lynch N., Gigascience 2(1), 2013
PMID: 23596984
Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.
Holley G, Wittler R, Stoye J., Algorithms Mol Biol 11(), 2016
PMID: 27087830


Compression of next-generation sequencing reads aided by highly efficient de novo assembly.
Jones DC, Ruzzo WL, Peng X, Katze MG., Nucleic Acids Res. 40(22), 2012
PMID: 22904078
Reference-based compression of short-read sequences using path encoding.
Kingsford C, Patro R., Bioinformatics 31(12), 2015
PMID: 25649622
Insights from 20 years of bacterial genome sequencing.
Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH, Karpinets T, Lund O, Kora G, Wassenaar T, Poudel S, Ussery DW., Funct. Integr. Genomics 15(2), 2015
PMID: 25722247
Compressive genomics.
Loh PR, Baym M, Berger B., Nat. Biotechnol. 30(7), 2012
PMID: 22781691
Comparison of high-throughput sequencing data compression tools.
Numanagic I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC., Nat. Methods 13(12), 2016
PMID: 27776113
Data-dependent bucketing improves reference-free compression of sequencing reads.
Patro R, Kingsford C., Bioinformatics 31(17), 2015
PMID: 25910696
Reducing storage requirements for biological sequence comparison.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA., Bioinformatics 20(18), 2004
PMID: 15256412
DSRC 2--Industry-oriented compression of FASTQ files.
Roguski L, Deorowicz S., Bioinformatics 30(15), 2014
PMID: 24747219
Fast lossless compression via cascading Bloom filters.
Rozov R, Shamir R, Halperin E., BMC Bioinformatics 15 Suppl 9(), 2014
PMID: 25252952


Using cascading Bloom filters to improve the memory usage for de Brujin graphs.
Salikhov K, Sacomoto G, Kucherov G., Algorithms Mol Biol 9(1), 2014
PMID: 24565280
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM., Proc. Natl. Acad. Sci. U.S.A. 102(39), 2005
PMID: 16172379


Entropy-scaling search of massive biological data.
Yu YW, Daniels NM, Danko DC, Berger B., Cell Syst 1(2), 2015
PMID: 26436140
The MaSuRCA genome assembler.
Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA., Bioinformatics 29(21), 2013
PMID: 23990416



Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®


PMID: 30011247
PubMed | Europe PMC

Suchen in

Google Scholar