Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley G, Wittler R, Stoye J (2016)
Algorithms for Molecular Biology 11(1): 3.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
 
Download
OA 1.81 MB
Abstract / Bemerkung
Background High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. Results In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. Conclusion We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. Availability https://​www.​github.​com/​GuillaumeHolley/​BloomFilterTrie.
Stichworte
Pan-genome Similar genomes Population genomics Colored de bruijn graph Bloom filter Compression Trie Index Succinct data structure
Erscheinungsjahr
2016
Zeitschriftentitel
Algorithms for Molecular Biology
Band
11
Ausgabe
1
Art.-Nr.
3
ISSN
1748-7188
eISSN
1748-7188
Finanzierungs-Informationen
Open-Access-Publikationskosten wurden durch die Deutsche Forschungsgemeinschaft und die Universität Bielefeld gefördert.
Page URI
https://pub.uni-bielefeld.de/record/2900129

Zitieren

Holley G, Wittler R, Stoye J. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology. 2016;11(1): 3.
Holley, G., Wittler, R., & Stoye, J. (2016). Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11(1), 3. doi:10.1186/s13015-016-0066-8
Holley, Guillaume, Wittler, Roland, and Stoye, Jens. 2016. “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”. Algorithms for Molecular Biology 11 (1): 3.
Holley, G., Wittler, R., and Stoye, J. (2016). Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology 11:3.
Holley, G., Wittler, R., & Stoye, J., 2016. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11(1): 3.
G. Holley, R. Wittler, and J. Stoye, “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”, Algorithms for Molecular Biology, vol. 11, 2016, : 3.
Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology. 11, : 3 (2016).
Holley, Guillaume, Wittler, Roland, and Stoye, Jens. “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”. Algorithms for Molecular Biology 11.1 (2016): 3.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2019-09-06T09:18:34Z
MD5 Prüfsumme
adcad0f5a33fafb421e5e0c679690a6d


7 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Alignment-free inference of hierarchical and reticulate phylogenomic relationships.
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA., Brief Bioinform 20(2), 2019
PMID: 28673025
Genome-wide somatic variant calling using localized colored de Bruijn graphs.
Narzisi G, Corvelo A, Arora K, Bergmann EA, Shah M, Musunuri R, Emde AK, Robine N, Vacic V, Zody MC., Commun Biol 1(), 2018
PMID: 30271907
SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform.
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y., BMC Bioinformatics 19(1), 2018
PMID: 29720081
Dynamic Alignment-Free and Reference-Free Read Compression.
Holley G, Wittler R, Stoye J, Hach F., J Comput Biol 25(7), 2018
PMID: 30011247
SeqOthello: querying RNA-seq experiments at scale.
Yu Y, Liu J, Liu X, Zhang Y, Magner E, Lehnert E, Qian C, Liu J., Genome Biol 19(1), 2018
PMID: 30340508
High-speed and high-ratio referential genome compression.
Liu Y, Peng H, Wong L, Li J., Bioinformatics 33(21), 2017
PMID: 28651329

26 References

Daten bereitgestellt von Europe PubMed Central.

De novo assembly and genotyping of variants using colored de Bruijn graphs.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G., Nat. Genet. 44(2), 2012
PMID: 22231483

AUTHOR UNKNOWN, 0

AUTHOR UNKNOWN, 0
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.
Cox AJ, Bauer MJ, Jakobi T, Rosone G., Bioinformatics 28(11), 2012
PMID: 22556365
An experimental study of an opportunistic index
Ferragina P, Manzini G., 2001
Space/time trade-offs in hash coding with allowable errors
Bloom BH., 1970

AUTHOR UNKNOWN, 0
Trie Memory
Fredking E., 1960
Burst tries: a fast, efficient data structure for string keys
Heinz S, Zobel J, Williams HE., 2002
A framework for variation discovery and genotyping using next-generation DNA sequencing data.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ., Nat. Genet. 43(5), 2011
PMID: 21478889
PanCake: a data structure for pangenomes
Ernst C, Rahmann S., 2013
The fragment assembly string graph
Myers EW., 2005
Building a pan-genome reference for a population.
Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Kent WJ, Haussler D, Paten B., J. Comput. Biol. 22(5), 2015
PMID: 25565268
Cactus graphs for genome comparisons.
Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D., J. Comput. Biol. 18(3), 2011
PMID: 21385048
Short read alignment with populations of genomes.
Huang L, Popic V, Batzoglou S., Bioinformatics 29(13), 2013
PMID: 23813006
Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R., Bioinformatics 25(14), 2009
PMID: 19451168
RCSI: Scalable similarity search in thousand(s) of genomes
Wandelt S, Starlinger J, Bux M, Leser U., 2013
MRCSI: compressing and searching string collections with multiple references
Wandelt S, Leser U., 2015
SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips.
Marcus S, Lee H, Schatz MC., Bioinformatics 30(24), 2014
PMID: 25398610
Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform.
Baier U, Beller T, Ohlebusch E., Bioinformatics 32(4), 2015
PMID: 26504144

AUTHOR UNKNOWN, 0
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.
Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT., Proc. Natl. Acad. Sci. U.S.A. 109(33), 2012
PMID: 22847406

AUTHOR UNKNOWN, 0
Informed and automated k-mer size selection for genome assembly.
Chikhi R, Medvedev P., Bioinformatics 30(1), 2013
PMID: 23732276

AUTHOR UNKNOWN, 0

AUTHOR UNKNOWN, 0
Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®
Quellen

PMID: 27087830
PubMed | Europe PMC

Suchen in

Google Scholar