Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012)
Bioinformatics 28(11): 1415-1419.

Download
No fulltext has been uploaded. References only!
Journal Article | Original Article | Published | English

No fulltext has been uploaded

Author
; ; ;
Abstract
MOTIVATION: The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work (Bauer et al. (2011)), we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. RESULTS: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FMindex (Ferragina and Manzini (2000)) on large-scale DNA sequence collections. AVAILABILITY: Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git. CONTACT: acox@illumina.com.
Publishing Year
ISSN
eISSN
PUB-ID

Cite this

Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012;28(11):1415-1419.
Cox, A. J., Bauer, M. J., Jakobi, T., & Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), 1415-1419. doi:10.1093/bioinformatics/bts173
Cox, A. J., Bauer, M. J., Jakobi, T., and Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415-1419.
Cox, A.J., et al., 2012. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), p 1415-1419.
A.J. Cox, et al., “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”, Bioinformatics, vol. 28, 2012, pp. 1415-1419.
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 28, 1415-1419 (2012).
Cox, Anthony J, Bauer, Markus J, Jakobi, Tobias, and Rosone, Giovanna. “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”. Bioinformatics 28.11 (2012): 1415-1419.
This data publication is cited in the following publications:
This publication cites the following data publications:

41 Citations in Europe PMC

Data provided by Europe PubMed Central.

High-throughput DNA sequence data compression.
Zhu Z, Zhang Y, Ji Z, He S, Yang X., Brief Bioinform 16(1), 2015
PMID: 24300111
Adaptive reference-free compression of sequence quality scores.
Janin L, Rosone G, Cox AJ., Bioinformatics 30(1), 2014
PMID: 23661694
Using Genome Query Language to uncover genetic variation.
Kozanitis C, Heiberg A, Varghese G, Bafna V., Bioinformatics 30(1), 2014
PMID: 23751181
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.
Bonham-Carter O, Steele J, Bastola D., Brief Bioinform 15(6), 2014
PMID: 23904502
MFCompress: a compression tool for FASTA and multi-FASTA data.
Pinho AJ, Pratas D., Bioinformatics 30(1), 2014
PMID: 24132931
NGC: lossless and lossy compression of aligned high-throughput sequencing data.
Popitsch N, von Haeseler A., Nucleic Acids Res 41(1), 2013
PMID: 23066097
Compression of FASTQ and SAM format sequencing data.
Bonfield JK, Mahoney MV., PLoS One 8(3), 2013
PMID: 23533605
Genome compression: a novel approach for large collections.
Deorowicz S, Danek A, Grabowski S., Bioinformatics 29(20), 2013
PMID: 23969136
Data compression for sequencing data.
Deorowicz S, Grabowski S., Algorithms Mol Biol 8(1), 2013
PMID: 24252160
DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.
Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X., PLoS One 8(11), 2013
PMID: 24282536
Compression of next-generation sequencing reads aided by highly efficient de novo assembly.
Jones DC, Ruzzo WL, Peng X, Katze MG., Nucleic Acids Res 40(22), 2012
PMID: 22904078
SCALCE: boosting sequence compression algorithms using locally consistent encoding.
Hach F, Numanagic I, Alkan C, Sahinalp SC., Bioinformatics 28(23), 2012
PMID: 23047557
Bioinformatics clouds for big data manipulation.
Dai L, Gao X, Guo Y, Xiao J, Zhang Z., Biol Direct 7(), 2012
PMID: 23190475

17 References

Data provided by Europe PubMed Central.


AUTHOR UNKNOWN, CPM 6661(), 2011
DNACompress: fast and effective DNA sequence compression.
Chen X, Li M, Ma B, Tromp J., Bioinformatics 18(12), 2002
PMID: 12490460
Compression of DNA sequence reads in FASTQ format.
Deorowicz S, Grabowski S., Bioinformatics 27(6), 2011
PMID: 21252073
Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA., PLoS Genet. 7(9), 2011
PMID: 21935354

AUTHOR UNKNOWN, J ACM 52(), 2005
Efficient storage of high throughput DNA sequencing data using reference-based compression.
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E., Genome Res. 21(5), 2011
PMID: 21245279
Textual data compression in computational biology: a synopsis.
Giancarlo R, Scaturro D, Utro F., Bioinformatics 25(13), 2009
PMID: 19251772

AUTHOR UNKNOWN, INF PROCESS MANAGE 30(), 1994
Compressing genomic sequence fragments using SlimGene.
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G., J. Comput. Biol. 18(3), 2011
PMID: 21385043
Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R., Bioinformatics 25(14), 2009
PMID: 19451168

AUTHOR UNKNOWN, CPM 3537(), 2005

AUTHOR UNKNOWN, COMPUT APPL BIOSCI CABIOS 9(), 1993

AUTHOR UNKNOWN, Biochimie 78(), 1996
Efficient construction of an assembly string graph using the FM-index.
Simpson JT, Durbin R., Bioinformatics 26(12), 2010
PMID: 20529929
G-SQZ: compact encoding of genomic sequence and quality data.
Tembe W, Lowey J, Suh E., Bioinformatics 26(17), 2010
PMID: 20605925

AUTHOR UNKNOWN, ALGOR MOL BIOL 6(), 2011

Export

0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®

Sources

PMID: 22556365
PubMed | Europe PMC

arXiv 1205.0192

Search this title in

Google Scholar