Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Cox, Anthony J; Bauer, Markus J; Jakobi, Tobias; Rosone, Giovanna

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012)
Bioinformatics 28(11): 1415-1419.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!

DOI

https://doi.org/10.1093/bioinformatics/bts173

Autor*in

Cox, Anthony J; Bauer, Markus J; Jakobi, Tobias; Rosone, Giovanna

Einrichtung

Centrum für Biotechnologie > Graduate Center > Graduate Cluster Industrial Biotechnology

Abstract / Bemerkung

MOTIVATION: The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work (Bauer et al. (2011)), we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. RESULTS: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FMindex (Ferragina and Manzini (2000)) on large-scale DNA sequence collections. AVAILABILITY: Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git. CONTACT: acox@illumina.com.

Erscheinungsjahr

2012

Zeitschriftentitel

Bioinformatics

Band

Ausgabe

Seite(n)

1415-1419

ISSN

1367-4803

eISSN

1460-2059

Page URI

https://pub.uni-bielefeld.de/record/2496829

Zitieren

Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012;28(11):1415-1419.

Cox, A. J., Bauer, M. J., Jakobi, T., & Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), 1415-1419. doi:10.1093/bioinformatics/bts173

Cox, Anthony J, Bauer, Markus J, Jakobi, Tobias, and Rosone, Giovanna. 2012. “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”. Bioinformatics 28 (11): 1415-1419.

Cox, A. J., Bauer, M. J., Jakobi, T., and Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415-1419.

Cox, A.J., et al., 2012. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), p 1415-1419.

A.J. Cox, et al., “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”, Bioinformatics, vol. 28, 2012, pp. 1415-1419.

Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 28, 1415-1419 (2012).

Cox, Anthony J, Bauer, Markus J, Jakobi, Tobias, and Rosone, Giovanna. “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”. Bioinformatics 28.11 (2012): 1415-1419.

Daten bereitgestellt von European Bioinformatics Institute (EBI)

44 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Tackling the Challenges of FASTQ Referential Compression.
Guerra A, Lotero J, Aedo JÉ, Isaza S., Bioinform Biol Insights 13(), 2019
PMID: 30792576

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.
Chandak S, Tatwawadi K, Weissman T., Bioinformatics 34(4), 2018
PMID: 29444237

Optimal compressed representation of high throughput sequence data via light assembly.
Ginart AA, Hui J, Zhu K, Numanagić I, Courtade TA, Sahinalp SC, Tse DN., Nat Commun 9(1), 2018
PMID: 29422526

FaStore: a space-saving solution for raw sequencing data.
Roguski L, Ochoa I, Hernaez M, Deorowicz S., Bioinformatics 34(16), 2018
PMID: 29617939

BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.
Wang R, Li J, Bai Y, Zang T, Wang Y., PeerJ 6(), 2018
PMID: 30364599

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.
Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM., Genome Res 27(2), 2017
PMID: 27986821

E2FM: an encrypted and compressed full-text index for collections of genomic sequences.
Montecuollo F, Schmid G, Tagliaferri R., Bioinformatics 33(18), 2017
PMID: 28498928

Advances in high throughput DNA sequence data compression.
Sardaraz M, Tahir M, Ikram AA., J Bioinform Comput Biol 14(3), 2016
PMID: 26846812

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.
Holley G, Wittler R, Stoye J., Algorithms Mol Biol 11(), 2016
PMID: 27087830

deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding.
Liu B, Zhu D, Wang Y., Bioinformatics 32(12), 2016
PMID: 27307614

GeneCodeq: quality score compression and improved genotyping using a Bayesian framework.
Greenfield DL, Stegle O, Rrustemi A., Bioinformatics 32(20), 2016
PMID: 27354700

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.
Beal R, Afrin T, Farheen A, Adjeroh D., BMC Genomics 17 Suppl 4(), 2016
PMID: 27556803

Comparison of high-throughput sequencing data compression tools.
Numanagić I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC., Nat Methods 13(12), 2016
PMID: 27776113

High-throughput DNA sequence data compression.
Zhu Z, Zhang Y, Ji Z, He S, Yang X., Brief Bioinform 16(1), 2015
PMID: 24300111

Disk-based compression of data from genome sequencing.
Grabowski S, Deorowicz S, Roguski Ł., Bioinformatics 31(9), 2015
PMID: 25536966

Ultrafast SNP analysis using the Burrows-Wheeler transform of short-read data.
Kimura K, Koike A., Bioinformatics 31(10), 2015
PMID: 25609790

Reference-based compression of short-read sequences using path encoding.
Kingsford C, Patro R., Bioinformatics 31(12), 2015
PMID: 25649622

MAFCO: a compression tool for MAF files.
Matos LM, Neves AJ, Pratas D, Pinho AJ., PLoS One 10(3), 2015
PMID: 25816229

Data-dependent bucketing improves reference-free compression of sequencing reads.
Patro R, Kingsford C., Bioinformatics 31(17), 2015
PMID: 25910696

LFQC: a lossless compression algorithm for FASTQ files.
Nicolae M, Pathak S, Rajasekaran S., Bioinformatics 31(20), 2015
PMID: 26093148

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G., BMC Bioinformatics 16(), 2015
PMID: 26370285

Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data.
Kimura K, Koike A., BMC Bioinformatics 16 Suppl 18(), 2015
PMID: 26678411

Adaptive reference-free compression of sequence quality scores.
Janin L, Rosone G, Cox AJ., Bioinformatics 30(1), 2014
PMID: 23661694

Using Genome Query Language to uncover genetic variation.
Kozanitis C, Heiberg A, Varghese G, Bafna V., Bioinformatics 30(1), 2014
PMID: 23751181

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.
Bonham-Carter O, Steele J, Bastola D., Brief Bioinform 15(6), 2014
PMID: 23904502

MFCompress: a compression tool for FASTA and multi-FASTA data.
Pinho AJ, Pratas D., Bioinformatics 30(1), 2014
PMID: 24132931

Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies.
Giancarlo R, Rombo SE, Utro F., Brief Bioinform 15(3), 2014
PMID: 24347576

XS: a FASTQ read simulator.
Pratas D, Pinho AJ, Rodrigues JM., BMC Res Notes 7(), 2014
PMID: 24433564

Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction.
Tran NH, Chen X., BMC Res Notes 7(), 2014
PMID: 24886411

BEETL-fastq: a searchable compressed archive for DNA reads.
Janin L, Schulz-Trieglaff O, Cox AJ., Bioinformatics 30(19), 2014
PMID: 24950811

Fast construction of FM-index for long sequence reads.
Li H., Bioinformatics 30(22), 2014
PMID: 25107872

Merging of multi-string BWTs with applications.
Holt J, McMillan L., Bioinformatics 30(24), 2014
PMID: 25172922

DeeZ: reference-based compression by local assembly.
Hach F, Numanagić I, Sahinalp SC., Nat Methods 11(11), 2014
PMID: 25357237

Compression of next-generation sequencing quality scores using memetic algorithm.
Zhou J, Ji Z, Zhu Z, He S., BMC Bioinformatics 15 Suppl 15(), 2014
PMID: 25474747

NGC: lossless and lossy compression of aligned high-throughput sequencing data.
Popitsch N, von Haeseler A., Nucleic Acids Res 41(1), 2013
PMID: 23066097

Compression of FASTQ and SAM format sequencing data.
Bonfield JK, Mahoney MV., PLoS One 8(3), 2013
PMID: 23533605

Genome compression: a novel approach for large collections.
Deorowicz S, Danek A, Grabowski S., Bioinformatics 29(20), 2013
PMID: 23969136

Data compression for sequencing data.
Deorowicz S, Grabowski S., Algorithms Mol Biol 8(1), 2013
PMID: 24252160

DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.
Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X., PLoS One 8(11), 2013
PMID: 24282536

SRComp: short read sequence compression using burstsort and Elias omega coding.
Selva JJ, Chen X., PLoS One 8(12), 2013
PMID: 24349065

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads.
Gautier L, Lund O., PLoS One 8(12), 2013
PMID: 24391826

Compression of next-generation sequencing reads aided by highly efficient de novo assembly.
Jones DC, Ruzzo WL, Peng X, Katze MG., Nucleic Acids Res 40(22), 2012
PMID: 22904078

SCALCE: boosting sequence compression algorithms using locally consistent encoding.
Hach F, Numanagic I, Alkan C, Sahinalp SC., Bioinformatics 28(23), 2012
PMID: 23047557

Bioinformatics clouds for big data manipulation.
Dai L, Gao X, Guo Y, Xiao J, Zhang Z., Biol Direct 7(), 2012
PMID: 23190475

17 References

Daten bereitgestellt von Europe PubMed Central.

AUTHOR UNKNOWN, CPM 6661(), 2011

DNACompress: fast and effective DNA sequence compression.
Chen X, Li M, Ma B, Tromp J., Bioinformatics 18(12), 2002
PMID: 12490460

Compression of DNA sequence reads in FASTQ format.
Deorowicz S, Grabowski S., Bioinformatics 27(6), 2011
PMID: 21252073

Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA., PLoS Genet. 7(9), 2011
PMID: 21935354

AUTHOR UNKNOWN, J ACM 52(), 2005

Efficient storage of high throughput DNA sequencing data using reference-based compression.
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E., Genome Res. 21(5), 2011
PMID: 21245279

Textual data compression in computational biology: a synopsis.
Giancarlo R, Scaturro D, Utro F., Bioinformatics 25(13), 2009
PMID: 19251772

AUTHOR UNKNOWN, INF PROCESS MANAGE 30(), 1994

Compressing genomic sequence fragments using SlimGene.
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G., J. Comput. Biol. 18(3), 2011
PMID: 21385043

Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R., Bioinformatics 25(14), 2009
PMID: 19451168

AUTHOR UNKNOWN, CPM 3537(), 2005

AUTHOR UNKNOWN, COMPUT APPL BIOSCI CABIOS 9(), 1993

AUTHOR UNKNOWN, Biochimie 78(), 1996

Efficient construction of an assembly string graph using the FM-index.
Simpson JT, Durbin R., Bioinformatics 26(12), 2010
PMID: 20529929

Efficient de novo assembly of large genomes using compressed data structures.
Simpson JT, Durbin R., Genome Res. 22(3), 2011
PMID: 22156294

G-SQZ: compact encoding of genomic sequence and quality data.
Tembe W, Lowey J, Suh E., Bioinformatics 26(17), 2010
PMID: 20605925

ReCoil - an algorithm for compression of extremely large datasets of dna data.
Yanovsky V., Algorithms Mol Biol 6(), 2011
PMID: 21988957

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 22556365
PubMed | Europe PMC

arXiv: 1205.0192

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld