Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012)
Bioinformatics 28(11): 1415-1419.

Journal Article | Published | English

No fulltext has been uploaded

Author
; ; ;
Abstract
MOTIVATION: The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work (Bauer et al. (2011)), we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets. RESULTS: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FMindex (Ferragina and Manzini (2000)) on large-scale DNA sequence collections. AVAILABILITY: Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git. CONTACT: acox@illumina.com.
Publishing Year
ISSN
eISSN
PUB-ID

Cite this

Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012;28(11):1415-1419.
Cox, A. J., Bauer, M. J., Jakobi, T., & Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), 1415-1419.
Cox, A. J., Bauer, M. J., Jakobi, T., and Rosone, G. (2012). Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415-1419.
Cox, A.J., et al., 2012. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11), p 1415-1419.
A.J. Cox, et al., “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”, Bioinformatics, vol. 28, 2012, pp. 1415-1419.
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 28, 1415-1419 (2012).
Cox, Anthony J, Bauer, Markus J, Jakobi, Tobias, and Rosone, Giovanna. “Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform”. Bioinformatics 28.11 (2012): 1415-1419.
This data publication is cited in the following publications:
This publication cites the following data publications:

31 Citations in Europe PMC

Data provided by Europe PubMed Central.

Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data.
Kimura K, Koike A., BMC Bioinformatics 16 Suppl 18(), 2015
PMID: 26678411
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G., BMC Bioinformatics 16(), 2015
PMID: 26370285
MAFCO: a compression tool for MAF files.
Matos LM, Neves AJ, Pratas D, Pinho AJ., PLoS ONE 10(3), 2015
PMID: 25816229
LFQC: a lossless compression algorithm for FASTQ files.
Pathak S, Rajasekaran S., Bioinformatics (), 2014
PMID: 25344499
DNA-COMPACT: DNA COMpression based on a pattern-aware contextual modeling technique.
Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X., PLoS ONE 8(11), 2013
PMID: 24282536

17 References

Data provided by Europe PubMed Central.


AUTHOR UNKNOWN, CPM 6661(), 2011
DNACompress: fast and effective DNA sequence compression.
Chen X, Li M, Ma B, Tromp J., Bioinformatics 18(12), 2002
PMID: 12490460
Compression of DNA sequence reads in FASTQ format.
Deorowicz S, Grabowski S., Bioinformatics 27(6), 2011
PMID: 21252073
Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, Cornejo OE, Knowles JW, Woon M, Sangkuhl K, Gong L, Thorn CF, Hebert JM, Capriotti E, David SP, Pavlovic A, West A, Thakuria JV, Ball MP, Zaranek AW, Rehm HL, Church GM, West JS, Bustamante CD, Snyder M, Altman RB, Klein TE, Butte AJ, Ashley EA., PLoS Genet. 7(9), 2011
PMID: 21935354

AUTHOR UNKNOWN, J ACM 52(), 2005
Efficient storage of high throughput DNA sequencing data using reference-based compression.
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E., Genome Res. 21(5), 2011
PMID: 21245279
Textual data compression in computational biology: a synopsis.
Giancarlo R, Scaturro D, Utro F., Bioinformatics 25(13), 2009
PMID: 19251772

AUTHOR UNKNOWN, INF PROCESS MANAGE 30(), 1994
Compressing genomic sequence fragments using SlimGene.
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G., J. Comput. Biol. 18(3), 2011
PMID: 21385043
Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R., Bioinformatics 25(14), 2009
PMID: 19451168

AUTHOR UNKNOWN, CPM 3537(), 2005

AUTHOR UNKNOWN, COMPUT APPL BIOSCI CABIOS 9(), 1993

AUTHOR UNKNOWN, Biochimie 78(), 1996
Efficient construction of an assembly string graph using the FM-index.
Simpson JT, Durbin R., Bioinformatics 26(12), 2010
PMID: 20529929
G-SQZ: compact encoding of genomic sequence and quality data.
Tembe W, Lowey J, Suh E., Bioinformatics 26(17), 2010
PMID: 20605925

AUTHOR UNKNOWN, ALGOR MOL BIOL 6(), 2011

Export

0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®

Sources

PMID: 22556365
PubMed | Europe PMC

arXiv 1205.0192

Search this title in

Google Scholar