Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley G, Wittler R, Stoye J (2016)
Algorithms for Molecular Biology 11: 3.

Download
OA 1.81 MB
Zeitschriftenaufsatz | Veröffentlicht | Englisch
Abstract / Bemerkung
Background High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. Results In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. Conclusion We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. Availability https://​www.​github.​com/​GuillaumeHolley/​BloomFilterTrie.
Erscheinungsjahr
Zeitschriftentitel
Algorithms for Molecular Biology
Band
11
Artikelnummer
3
eISSN
Finanzierungs-Informationen
Article Processing Charge funded by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.
PUB-ID

Zitieren

Holley G, Wittler R, Stoye J. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology. 2016;11: 3.
Holley, G., Wittler, R., & Stoye, J. (2016). Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11, 3. doi:10.1186/s13015-016-0066-8
Holley, G., Wittler, R., and Stoye, J. (2016). Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology 11:3.
Holley, G., Wittler, R., & Stoye, J., 2016. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11: 3.
G. Holley, R. Wittler, and J. Stoye, “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”, Algorithms for Molecular Biology, vol. 11, 2016, : 3.
Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology. 11, : 3 (2016).
Holley, Guillaume, Wittler, Roland, and Stoye, Jens. “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”. Algorithms for Molecular Biology 11 (2016): 3.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2016-06-30T06:30:13Z

2 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

High-speed and high-ratio referential genome compression.
Liu Y, Peng H, Wong L, Li J., Bioinformatics 33(21), 2017
PMID: 28651329
Alignment-free inference of hierarchical and reticulate phylogenomic relationships.
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA., Brief Bioinform (), 2017
PMID: 28673025

26 References

Daten bereitgestellt von Europe PubMed Central.


AUTHOR UNKNOWN, 0

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 27087830
PubMed | Europe PMC

Suchen in

Google Scholar