Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012)
Bioinformatics 28(11): 1415-1419.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Cox, Anthony J; Bauer, Markus J; Jakobi, Tobias; Rosone, Giovanna
Abstract / Bemerkung
MOTIVATION: The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work (Bauer et al. (2011)), we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. RESULTS: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FMindex (Ferragina and Manzini (2000)) on large-scale DNA sequence collections. AVAILABILITY: Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at CONTACT:
Page URI


Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012;28(11):1415-1419.
Cox, A. J., Bauer, M. J., Jakobi, T., & Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), 1415-1419. doi:10.1093/bioinformatics/bts173
Cox, Anthony J, Bauer, Markus J, Jakobi, Tobias, and Rosone, Giovanna. 2012. “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”. Bioinformatics 28 (11): 1415-1419.
Cox, A. J., Bauer, M. J., Jakobi, T., and Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415-1419.
Cox, A.J., et al., 2012. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), p 1415-1419.
A.J. Cox, et al., “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”, Bioinformatics, vol. 28, 2012, pp. 1415-1419.
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 28, 1415-1419 (2012).
Cox, Anthony J, Bauer, Markus J, Jakobi, Tobias, and Rosone, Giovanna. “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”. Bioinformatics 28.11 (2012): 1415-1419.

44 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Tackling the Challenges of FASTQ Referential Compression.
Guerra A, Lotero J, Aedo JÉ, Isaza S., Bioinform Biol Insights 13(), 2019
PMID: 30792576
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.
Chandak S, Tatwawadi K, Weissman T., Bioinformatics 34(4), 2018
PMID: 29444237
Optimal compressed representation of high throughput sequence data via light assembly.
Ginart AA, Hui J, Zhu K, Numanagić I, Courtade TA, Sahinalp SC, Tse DN., Nat Commun 9(1), 2018
PMID: 29422526
FaStore: a space-saving solution for raw sequencing data.
Roguski L, Ochoa I, Hernaez M, Deorowicz S., Bioinformatics 34(16), 2018
PMID: 29617939
Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.
Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM., Genome Res 27(2), 2017
PMID: 27986821
E2FM: an encrypted and compressed full-text index for collections of genomic sequences.
Montecuollo F, Schmid G, Tagliaferri R., Bioinformatics 33(18), 2017
PMID: 28498928
Advances in high throughput DNA sequence data compression.
Sardaraz M, Tahir M, Ikram AA., J Bioinform Comput Biol 14(3), 2016
PMID: 26846812
Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.
Holley G, Wittler R, Stoye J., Algorithms Mol Biol 11(), 2016
PMID: 27087830
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework.
Greenfield DL, Stegle O, Rrustemi A., Bioinformatics 32(20), 2016
PMID: 27354700
A new algorithm for "the LCS problem" with application in compressing genome resequencing data.
Beal R, Afrin T, Farheen A, Adjeroh D., BMC Genomics 17 Suppl 4(), 2016
PMID: 27556803
Comparison of high-throughput sequencing data compression tools.
Numanagić I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC., Nat Methods 13(12), 2016
PMID: 27776113
High-throughput DNA sequence data compression.
Zhu Z, Zhang Y, Ji Z, He S, Yang X., Brief Bioinform 16(1), 2015
PMID: 24300111
Disk-based compression of data from genome sequencing.
Grabowski S, Deorowicz S, Roguski Ł., Bioinformatics 31(9), 2015
PMID: 25536966
Reference-based compression of short-read sequences using path encoding.
Kingsford C, Patro R., Bioinformatics 31(12), 2015
PMID: 25649622
MAFCO: a compression tool for MAF files.
Matos LM, Neves AJ, Pratas D, Pinho AJ., PLoS One 10(3), 2015
PMID: 25816229
Data-dependent bucketing improves reference-free compression of sequencing reads.
Patro R, Kingsford C., Bioinformatics 31(17), 2015
PMID: 25910696
LFQC: a lossless compression algorithm for FASTQ files.
Nicolae M, Pathak S, Rajasekaran S., Bioinformatics 31(20), 2015
PMID: 26093148
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G., BMC Bioinformatics 16(), 2015
PMID: 26370285
Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data.
Kimura K, Koike A., BMC Bioinformatics 16 Suppl 18(), 2015
PMID: 26678411
Adaptive reference-free compression of sequence quality scores.
Janin L, Rosone G, Cox AJ., Bioinformatics 30(1), 2014
PMID: 23661694
Using Genome Query Language to uncover genetic variation.
Kozanitis C, Heiberg A, Varghese G, Bafna V., Bioinformatics 30(1), 2014
PMID: 23751181
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.
Bonham-Carter O, Steele J, Bastola D., Brief Bioinform 15(6), 2014
PMID: 23904502
MFCompress: a compression tool for FASTA and multi-FASTA data.
Pinho AJ, Pratas D., Bioinformatics 30(1), 2014
PMID: 24132931
XS: a FASTQ read simulator.
Pratas D, Pinho AJ, Rodrigues JM., BMC Res Notes 7(), 2014
PMID: 24433564
BEETL-fastq: a searchable compressed archive for DNA reads.
Janin L, Schulz-Trieglaff O, Cox AJ., Bioinformatics 30(19), 2014
PMID: 24950811
Fast construction of FM-index for long sequence reads.
Li H., Bioinformatics 30(22), 2014
PMID: 25107872
Merging of multi-string BWTs with applications.
Holt J, McMillan L., Bioinformatics 30(24), 2014
PMID: 25172922
DeeZ: reference-based compression by local assembly.
Hach F, Numanagić I, Sahinalp SC., Nat Methods 11(11), 2014
PMID: 25357237
Compression of next-generation sequencing quality scores using memetic algorithm.
Zhou J, Ji Z, Zhu Z, He S., BMC Bioinformatics 15 Suppl 15(), 2014
PMID: 25474747
NGC: lossless and lossy compression of aligned high-throughput sequencing data.
Popitsch N, von Haeseler A., Nucleic Acids Res 41(1), 2013
PMID: 23066097
Compression of FASTQ and SAM format sequencing data.
Bonfield JK, Mahoney MV., PLoS One 8(3), 2013
PMID: 23533605
Genome compression: a novel approach for large collections.
Deorowicz S, Danek A, Grabowski S., Bioinformatics 29(20), 2013
PMID: 23969136
Data compression for sequencing data.
Deorowicz S, Grabowski S., Algorithms Mol Biol 8(1), 2013
PMID: 24252160
DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.
Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X., PLoS One 8(11), 2013
PMID: 24282536
Compression of next-generation sequencing reads aided by highly efficient de novo assembly.
Jones DC, Ruzzo WL, Peng X, Katze MG., Nucleic Acids Res 40(22), 2012
PMID: 22904078
SCALCE: boosting sequence compression algorithms using locally consistent encoding.
Hach F, Numanagic I, Alkan C, Sahinalp SC., Bioinformatics 28(23), 2012
PMID: 23047557
Bioinformatics clouds for big data manipulation.
Dai L, Gao X, Guo Y, Xiao J, Zhang Z., Biol Direct 7(), 2012
PMID: 23190475

17 References

Daten bereitgestellt von Europe PubMed Central.

AUTHOR UNKNOWN, CPM 6661(), 2011
DNACompress: fast and effective DNA sequence compression.
Chen X, Li M, Ma B, Tromp J., Bioinformatics 18(12), 2002
PMID: 12490460
Compression of DNA sequence reads in FASTQ format.
Deorowicz S, Grabowski S., Bioinformatics 27(6), 2011
PMID: 21252073
Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA., PLoS Genet. 7(9), 2011
PMID: 21935354

Efficient storage of high throughput DNA sequencing data using reference-based compression.
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E., Genome Res. 21(5), 2011
PMID: 21245279
Textual data compression in computational biology: a synopsis.
Giancarlo R, Scaturro D, Utro F., Bioinformatics 25(13), 2009
PMID: 19251772

Compressing genomic sequence fragments using SlimGene.
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G., J. Comput. Biol. 18(3), 2011
PMID: 21385043
Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R., Bioinformatics 25(14), 2009
PMID: 19451168

AUTHOR UNKNOWN, CPM 3537(), 2005


AUTHOR UNKNOWN, Biochimie 78(), 1996
Efficient construction of an assembly string graph using the FM-index.
Simpson JT, Durbin R., Bioinformatics 26(12), 2010
PMID: 20529929
G-SQZ: compact encoding of genomic sequence and quality data.
Tembe W, Lowey J, Suh E., Bioinformatics 26(17), 2010
PMID: 20605925

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

PMID: 22556365
PubMed | Europe PMC

arXiv: 1205.0192

Suchen in

Google Scholar