Dynamic Alignment-Free and Reference-Free Read Compression

Holley, Guillaume; Wittler, Roland; Stoye, Jens; Hach, Faraz

Dynamic Alignment-Free and Reference-Free Read Compression

Holley G, Wittler R, Stoye J, Hach F (2018)
JOURNAL OF COMPUTATIONAL BIOLOGY 25(7): 825-836.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!

DOI

https://doi.org/10.1089/cmb.2018.0068

Autor*in

Holley, Guillaume^UniBi; Wittler, Roland^UniBi ; Stoye, Jens^UniBi ; Hach, Faraz

Einrichtung

Technische Fakultät > AG Genominformatik
Centrum für Biotechnologie > Arbeitsgruppe J. Stoye
Technische Fakultät > Int. Graduiertenkolleg DiDy (GRK 1906)

Abstract / Bemerkung

The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.

Stichworte

guided de Bruijn graph; high throughput sequencing; sequence compression

Erscheinungsjahr

2018

Zeitschriftentitel

JOURNAL OF COMPUTATIONAL BIOLOGY

Band

Ausgabe

Seite(n)

825-836

Konferenz

21st Annual International Conference on Research in Computational Molecular Biology (RECOMB)

Konferenzort

Hong Kong, HONG KONG

ISSN

1066-5277

eISSN

1557-8666

Page URI

https://pub.uni-bielefeld.de/record/2930268

Zitieren

Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY. 2018;25(7):825-836.

Holley, G., Wittler, R., Stoye, J., & Hach, F. (2018). Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY, 25(7), 825-836. doi:10.1089/cmb.2018.0068

Holley, Guillaume, Wittler, Roland, Stoye, Jens, and Hach, Faraz. 2018. “Dynamic Alignment-Free and Reference-Free Read Compression”. JOURNAL OF COMPUTATIONAL BIOLOGY 25 (7): 825-836.

Holley, G., Wittler, R., Stoye, J., and Hach, F. (2018). Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY 25, 825-836.

Holley, G., et al., 2018. Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY, 25(7), p 825-836.

G. Holley, et al., “Dynamic Alignment-Free and Reference-Free Read Compression”, JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 25, 2018, pp. 825-836.

Holley, G., Wittler, R., Stoye, J., Hach, F.: Dynamic Alignment-Free and Reference-Free Read Compression. JOURNAL OF COMPUTATIONAL BIOLOGY. 25, 825-836 (2018).

Holley, Guillaume, Wittler, Roland, Stoye, Jens, and Hach, Faraz. “Dynamic Alignment-Free and Reference-Free Read Compression”. JOURNAL OF COMPUTATIONAL BIOLOGY 25.7 (2018): 825-836.

Daten bereitgestellt von European Bioinformatics Institute (EBI)

1 Zitation in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.
Wang R, Li J, Bai Y, Zang T, Wang Y., PeerJ 6(), 2018
PMID: 30364599

33 References

Daten bereitgestellt von Europe PubMed Central.

A global reference for human genetic variation.
X, Nature 526(7571), 2015
PMID: 26432245

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G., BMC Bioinformatics 16(), 2015
PMID: 26370285

AUTHOR UNKNOWN, 0

Compression of FASTQ and SAM format sequencing data.
Bonfield JK, Mahoney MV., PLoS ONE 8(3), 2013
PMID: 23533605

Burrows, Digital SRC Research Report (), 1994

How to apply de Bruijn graphs to genome assembly.
Compeau PE, Pevzner PA, Tesler G., Nat. Biotechnol. 29(11), 2011
PMID: 22068540

Data compression for sequencing data.
Deorowicz S, Grabowski S., Algorithms Mol Biol 8(1), 2013
PMID: 24252160

Closure of the NCBI SRA and implications for the long-term future of genomics data storage.
Lipman D, Flicek P, Salzberg S, Gerstein M, Knight R., Genome Biol. 12(3), 2011
PMID: 21418618

Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies.
Giancarlo R, Rombo SE, Utro F., Brief. Bioinformatics 15(3), 2013
PMID: 24347576

Disk-based compression of data from genome sequencing.
Grabowski S, Deorowicz S, Roguski L., Bioinformatics 31(9), 2014
PMID: 25536966

SCALCE: boosting sequence compression algorithms using locally consistent encoding.
Hach F, Numanagic I, Alkan C, Sahinalp SC., Bioinformatics 28(23), 2012
PMID: 23047557

Sequence squeeze: an open contest for sequence compression.
Holland RC, Lynch N., Gigascience 2(1), 2013
PMID: 23596984

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.
Holley G, Wittler R, Stoye J., Algorithms Mol Biol 11(), 2016
PMID: 27087830

AUTHOR UNKNOWN, 0

Compression of next-generation sequencing reads aided by highly efficient de novo assembly.
Jones DC, Ruzzo WL, Peng X, Katze MG., Nucleic Acids Res. 40(22), 2012
PMID: 22904078

Reference-based compression of short-read sequences using path encoding.
Kingsford C, Patro R., Bioinformatics 31(12), 2015
PMID: 25649622

Insights from 20 years of bacterial genome sequencing.
Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH, Karpinets T, Lund O, Kora G, Wassenaar T, Poudel S, Ussery DW., Funct. Integr. Genomics 15(2), 2015
PMID: 25722247

Compressive genomics.
Loh PR, Baym M, Berger B., Nat. Biotechnol. 30(7), 2012
PMID: 22781691

Comparison of high-throughput sequencing data compression tools.
Numanagic I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC., Nat. Methods 13(12), 2016
PMID: 27776113

Data-dependent bucketing improves reference-free compression of sequencing reads.
Patro R, Kingsford C., Bioinformatics 31(17), 2015
PMID: 25910696

Reducing storage requirements for biological sequence comparison.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA., Bioinformatics 20(18), 2004
PMID: 15256412

DSRC 2--Industry-oriented compression of FASTQ files.
Roguski L, Deorowicz S., Bioinformatics 30(15), 2014
PMID: 24747219

Fast lossless compression via cascading Bloom filters.
Rozov R, Shamir R, Halperin E., BMC Bioinformatics 15 Suppl 9(), 2014
PMID: 25252952

AUTHOR UNKNOWN, 0

Using cascading Bloom filters to improve the memory usage for de Brujin graphs.
Salikhov K, Sacomoto G, Kucherov G., Algorithms Mol Biol 9(1), 2014
PMID: 24565280

Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM., Proc. Natl. Acad. Sci. U.S.A. 102(39), 2005
PMID: 16172379

AUTHOR UNKNOWN, 0

Entropy-scaling search of massive biological data.
Yu YW, Daniels NM, Danko DC, Berger B., Cell Syst 1(2), 2015
PMID: 26436140

The MaSuRCA genome assembler.
Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA., Bioinformatics 29(21), 2013
PMID: 23990416

AUTHOR UNKNOWN, 0

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB