Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley, Guillaume; Wittler, Roland; Stoye, Jens

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Holley G, Wittler R, Stoye J (2016)
Algorithms for Molecular Biology 11(1): 3.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

holley.13015-016-0066-8.pdf 1.81 MB

DOI

https://doi.org/10.1186/s13015-016-0066-8

URN

urn:nbn:de:0070-pub-29001293

Autor*in

Holley, Guillaume^UniBi; Wittler, Roland^UniBi ; Stoye, Jens^UniBi

Einrichtung

Centrum für Biotechnologie > Institut für Bioinformatik
Technische Fakultät > AG Genominformatik
Technische Fakultät > Int. Graduiertenkolleg DiDy (GRK 1906)
Centrum für Biotechnologie > Arbeitsgruppe J. Stoye

Abstract / Bemerkung

Background High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. Results In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. Conclusion We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. Availability https://www.github.com/GuillaumeHolley/BloomFilterTrie.

Stichworte

Pan-genome Similar genomes Population genomics Colored de bruijn graph Bloom filter Compression Trie Index Succinct data structure

Erscheinungsjahr

2016

Zeitschriftentitel

Algorithms for Molecular Biology

Band

Ausgabe

Art.-Nr.

ISSN

1748-7188

eISSN

1748-7188

Finanzierungs-Informationen

Open-Access-Publikationskosten wurden durch die Deutsche Forschungsgemeinschaft und die Universität Bielefeld gefördert.

Page URI

https://pub.uni-bielefeld.de/record/2900129

Zitieren

Holley G, Wittler R, Stoye J. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology. 2016;11(1): 3.

Holley, G., Wittler, R., & Stoye, J. (2016). Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11(1), 3. doi:10.1186/s13015-016-0066-8

Holley, Guillaume, Wittler, Roland, and Stoye, Jens. 2016. “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”. Algorithms for Molecular Biology 11 (1): 3.

Holley, G., Wittler, R., and Stoye, J. (2016). Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology 11:3.

Holley, G., Wittler, R., & Stoye, J., 2016. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology, 11(1): 3.

G. Holley, R. Wittler, and J. Stoye, “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”, Algorithms for Molecular Biology, vol. 11, 2016, : 3.

Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology. 11, : 3 (2016).

Holley, Guillaume, Wittler, Roland, and Stoye, Jens. “Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage”. Algorithms for Molecular Biology 11.1 (2016): 3.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Copyright Statement:

Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]

Volltext(e)

Name

holley.13015-016-0066-8.pdf 1.81 MB

Access Level

Open Access

Zuletzt Hochgeladen

2019-09-06T09:18:34Z

MD5 Prüfsumme

adcad0f5a33fafb421e5e0c679690a6d

Daten bereitgestellt von European Bioinformatics Institute (EBI)

7 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Alignment-free inference of hierarchical and reticulate phylogenomic relationships.
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA., Brief Bioinform 20(2), 2019
PMID: 28673025

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.
Solomon B, Kingsford C., J Comput Biol 25(7), 2018
PMID: 29641248

Genome-wide somatic variant calling using localized colored de Bruijn graphs.
Narzisi G, Corvelo A, Arora K, Bergmann EA, Shah M, Musunuri R, Emde AK, Robine N, Vacic V, Zody MC., Commun Biol 1(), 2018
PMID: 30271907

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform.
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y., BMC Bioinformatics 19(1), 2018
PMID: 29720081

Dynamic Alignment-Free and Reference-Free Read Compression.
Holley G, Wittler R, Stoye J, Hach F., J Comput Biol 25(7), 2018
PMID: 30011247

SeqOthello: querying RNA-seq experiments at scale.
Yu Y, Liu J, Liu X, Zhang Y, Magner E, Lehnert E, Qian C, Liu J., Genome Biol 19(1), 2018
PMID: 30340508

High-speed and high-ratio referential genome compression.
Liu Y, Peng H, Wong L, Li J., Bioinformatics 33(21), 2017
PMID: 28651329

26 References

Daten bereitgestellt von Europe PubMed Central.

De novo assembly and genotyping of variants using colored de Bruijn graphs.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G., Nat. Genet. 44(2), 2012
PMID: 22231483

AUTHOR UNKNOWN, 0

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.
Cox AJ, Bauer MJ, Jakobi T, Rosone G., Bioinformatics 28(11), 2012
PMID: 22556365

An experimental study of an opportunistic index
Ferragina P, Manzini G., 2001

Space/time trade-offs in hash coding with allowable errors
Bloom BH., 1970

AUTHOR UNKNOWN, 0

Trie Memory
Fredking E., 1960

Burst tries: a fast, efficient data structure for string keys
Heinz S, Zobel J, Williams HE., 2002

A framework for variation discovery and genotyping using next-generation DNA sequencing data.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ., Nat. Genet. 43(5), 2011
PMID: 21478889

PanCake: a data structure for pangenomes
Ernst C, Rahmann S., 2013

The fragment assembly string graph
Myers EW., 2005

Building a pan-genome reference for a population.
Nguyen N, Hickey G, Zerbino DR, Raney B, Earl D, Armstrong J, Kent WJ, Haussler D, Paten B., J. Comput. Biol. 22(5), 2015
PMID: 25565268

Cactus graphs for genome comparisons.
Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D., J. Comput. Biol. 18(3), 2011
PMID: 21385048

Short read alignment with populations of genomes.
Huang L, Popic V, Batzoglou S., Bioinformatics 29(13), 2013
PMID: 23813006

Fast and accurate short read alignment with Burrows-Wheeler transform.
Li H, Durbin R., Bioinformatics 25(14), 2009
PMID: 19451168

RCSI: Scalable similarity search in thousand(s) of genomes
Wandelt S, Starlinger J, Bux M, Leser U., 2013

MRCSI: compressing and searching string collections with multiple references
Wandelt S, Leser U., 2015

SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips.
Marcus S, Lee H, Schatz MC., Bioinformatics 30(24), 2014
PMID: 25398610

Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform.
Baier U, Beller T, Ohlebusch E., Bioinformatics 32(4), 2015
PMID: 26504144

AUTHOR UNKNOWN, 0

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.
Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT., Proc. Natl. Acad. Sci. U.S.A. 109(33), 2012
PMID: 22847406

AUTHOR UNKNOWN, 0

Informed and automated k-mer size selection for genome assembly.
Chikhi R, Medvedev P., Bioinformatics 30(1), 2013
PMID: 23732276

AUTHOR UNKNOWN, 0

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB