Efficient q-gram filters for finding all epsilon-matches over a given length

Rasmussen, Kim R.; Stoye, Jens; Myers, Eugene W.

Efficient q-gram filters for finding all epsilon-matches over a given length

Rasmussen KR, Stoye J, Myers EW (2006)
JOURNAL OF COMPUTATIONAL BIOLOGY 13(2): 296-308.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!

DOI

https://doi.org/10.1089/cmb.2006.13.296

Autor*in

Rasmussen, Kim R.; Stoye, Jens^UniBi ; Myers, Eugene W.

Einrichtung

Technische Fakultät > AG Genominformatik
Centrum für Biotechnologie > Institut für Bioinformatik
Centrum für Biotechnologie > Arbeitsgruppe J. Stoye

Abstract / Bemerkung

Fast and exact comparison of large genomic sequences remains a challenging task in biosequence analysis. We consider the problem of finding all epsilon-matches between two sequences, i.e., all local alignments over a given length with an error rate of at most epsilon. We study this problem theoretically, giving an efficient q-gram filter for solving it. Two applications of the filter are also discussed, in particular genomic sequence assembly and BLAST-like sequence comparison. Our results show that the method is 25 times faster than BLAST, while not being heuristic.

Stichworte

local alignment searching; q-grams; filter; clustering; EST; sequence assembly

Erscheinungsjahr

2006

Zeitschriftentitel

JOURNAL OF COMPUTATIONAL BIOLOGY

Band

Ausgabe

Seite(n)

296-308

ISSN

1066-5277

eISSN

1557-8666

Page URI

https://pub.uni-bielefeld.de/record/1599570

Zitieren

Rasmussen KR, Stoye J, Myers EW. Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY. 2006;13(2):296-308.

Rasmussen, K. R., Stoye, J., & Myers, E. W. (2006). Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY, 13(2), 296-308. https://doi.org/10.1089/cmb.2006.13.296

Rasmussen, Kim R., Stoye, Jens, and Myers, Eugene W. 2006. “Efficient q-gram filters for finding all epsilon-matches over a given length”. JOURNAL OF COMPUTATIONAL BIOLOGY 13 (2): 296-308.

Rasmussen, K. R., Stoye, J., and Myers, E. W. (2006). Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY 13, 296-308.

Rasmussen, K.R., Stoye, J., & Myers, E.W., 2006. Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY, 13(2), p 296-308.

K.R. Rasmussen, J. Stoye, and E.W. Myers, “Efficient q-gram filters for finding all epsilon-matches over a given length”, JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 13, 2006, pp. 296-308.

Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY. 13, 296-308 (2006).

Rasmussen, Kim R., Stoye, Jens, and Myers, Eugene W. “Efficient q-gram filters for finding all epsilon-matches over a given length”. JOURNAL OF COMPUTATIONAL BIOLOGY 13.2 (2006): 296-308.

Daten bereitgestellt von European Bioinformatics Institute (EBI)

38 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Systematic comparative study of computational methods for T-cell receptor sequencing data analysis.
Afzal S, Gil-Farina I, Gabriel R, Ahmad S, von Kalle C, Schmidt M, Fronza R., Brief Bioinform 20(1), 2019
PMID: 29028876

ImtRDB: a database and software for mitochondrial imperfect interspersed repeats annotation.
Shamanskiy VA, Timonina VN, Popadin KY, Gunbin KV., BMC Genomics 20(suppl 3), 2019
PMID: 31284879

Linking temporal medical records using non-protected health information data.
Bonomi L, Jiang X., Stat Methods Med Res 27(11), 2018
PMID: 29298592

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.
Solomon B, Kingsford C., J Comput Biol 25(7), 2018
PMID: 29641248

Short Read Mapping: An Algorithmic Tour.
Canzar S, Salzberg SL., Proc IEEE Inst Electr Electron Eng 105(3), 2017
PMID: 28502990

GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping.
Alser M, Hassan H, Xin H, Ergin O, Mutlu O, Alkan C., Bioinformatics 33(21), 2017
PMID: 28575161

PopIns: population-scale detection of novel sequence insertions.
Kehr B, Melsted P, Halldórsson BV., Bioinformatics 32(7), 2016
PMID: 25926346

Optimal seed solver: optimizing seed selection in read mapping.
Xin H, Nahar S, Zhu R, Emmons J, Pekhimenko G, Kingsford C, Alkan C, Mutlu O., Bioinformatics 32(11), 2016
PMID: 26568624

rHAT: fast alignment of noisy long reads with regional hashing.
Liu B, Guan D, Teng M, Wang Y., Bioinformatics 32(11), 2016
PMID: 26568628

Fast search of thousands of short-read sequencing experiments.
Solomon B, Kingsford C., Nat Biotechnol 34(3), 2016
PMID: 26854477

Circular sequence comparison: algorithms and applications.
Grossi R, Iliopoulos CS, Mercas R, Pisanti N, Pissis SP, Retha A, Vayani F., Algorithms Mol Biol 11(), 2016
PMID: 27168761

Alignment of Next-Generation Sequencing Reads.
Reinert K, Langmead B, Weese D, Evers DJ., Annu Rev Genomics Hum Genet 16(), 2015
PMID: 25939052

IMSEQ--a fast and error aware approach to immunogenetic sequence analysis.
Kuchenbecker L, Nienen M, Hecht J, Neumann AU, Babel N, Reinert K, Robinson PN., Bioinformatics 31(18), 2015
PMID: 25987567

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM., Nat Biotechnol 33(6), 2015
PMID: 26006009

BitMapper: an efficient all-mapper based on bit-vector computing.
Cheng H, Jiang H, Yang J, Xu Y, Shang Y., BMC Bioinformatics 16(), 2015
PMID: 26063651

Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone.
Trappe K, Emde AK, Ehrlich HC, Reinert K., Bioinformatics 30(24), 2014
PMID: 25028727

Massively parallel read mapping on GPUs with the q-group index and PEANUT.
Köster J, Rahmann S., PeerJ 2(), 2014
PMID: 25289191

A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances.
Noé L, Noé L, Martin DE., J Comput Biol 21(12), 2014
PMID: 25393923

Critical role of bioinformatics in translating huge amounts of next-generation sequencing data into personalized medicine.
Hong H, Zhang W, Shen J, Su Z, Ning B, Han T, Perkins R, Shi L, Tong W., Sci China Life Sci 56(2), 2013
PMID: 23393026

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.
Liao Y, Smyth GK, Shi W., Nucleic Acids Res 41(10), 2013
PMID: 23558742

Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis.
Cordero F, Beccuti M, Arigoni M, Donatelli S, Calogero RA., PLoS One 7(2), 2012
PMID: 22363693

RazerS 3: faster, fully sensitive read mapping.
Weese D, Holtgrewe M, Reinert K., Bioinformatics 28(20), 2012
PMID: 22923295

Long read alignment based on maximal exact match seeds.
Liu Y, Schmidt B., Bioinformatics 28(18), 2012
PMID: 22962447

Normalized N50 assembly metric using gap-restricted co-linear chaining.
Mäkinen V, Salmela L, Ylinen J., BMC Bioinformatics 13(), 2012
PMID: 23031320

Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing.
Misra S, Agrawal A, Liao WK, Choudhary A., Bioinformatics 27(2), 2011
PMID: 21088030

SHRiMP2: sensitive yet practical SHort Read Mapping.
David M, Dzamba M, Lister D, Ilie L, Brudno M., Bioinformatics 27(7), 2011
PMID: 21278192

Considering transposable element diversification in de novo annotation approaches.
Flutre T, Duprat E, Feuillet C, Quesneville H., PLoS One 6(1), 2011
PMID: 21304975

STELLAR: fast and exact local alignments.
Kehr B, Weese D, Reinert K., BMC Bioinformatics 12 Suppl 9(), 2011
PMID: 22151882

MicroRazerS: rapid alignment of small RNA reads.
Emde AK, Grunert M, Weese D, Reinert K, Sperling SR., Bioinformatics 26(1), 2010
PMID: 19880369

Phylogenetic comparative assembly.
Husemann P, Stoye J., Algorithms Mol Biol 5(), 2010
PMID: 20047659

Fast and SNP-tolerant detection of complex variants and splicing in short reads.
Wu TD, Nacu S., Bioinformatics 26(7), 2010
PMID: 20147302

Lossless filter for multiple repeats with bounded edit distance.
Peterlongo P, Sacomoto GA, do Lago AP, Pisanti N, Sagot MF., Algorithms Mol Biol 4(), 2009
PMID: 19183438

Massive parallel bisulfite sequencing of CG-rich DNA fragments reveals that methylation of many X-chromosomal CpG islands in female blood DNA is incomplete.
Zeschnigk M, Martin M, Betzl G, Kalbe A, Sirsch C, Buiting K, Gross S, Fritzilas E, Frey B, Rahmann S, Horsthemke B., Hum Mol Genet 18(8), 2009
PMID: 19223391

A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads.
Rausch T, Koren S, Denisov G, Weese D, Emde AK, Döring A, Reinert K., Bioinformatics 25(9), 2009
PMID: 19269990

RazerS--fast read mapping with sensitivity control.
Weese D, Emde AK, Rausch T, Döring A, Reinert K., Genome Res 19(9), 2009
PMID: 19592482

Sense from sequence reads: methods for alignment and assembly.
Flicek P, Birney E., Nat Methods 6(11 suppl), 2009
PMID: 19844229

Locked nucleic acid-based in situ detection of microRNAs in mouse tissue sections.
Obernosterer G, Martinez J, Alenius M., Nat Protoc 2(6), 2007
PMID: 17571058

PILER-CR: fast and accurate identification of CRISPR repeats.
Edgar RC., BMC Bioinformatics 8(), 2007
PMID: 17239253

14 References

Daten bereitgestellt von Europe PubMed Central.

Basic local alignment search tool.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., J. Mol. Biol. 215(3), 1990
PMID: 2231712

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694

AUTHOR UNKNOWN, 0

BLAT--the BLAST-like alignment tool.
Kent WJ., Genome Res. 12(4), 2002
PMID: 11932250

PatternHunter: faster and more sensitive homology search.
Ma B, Tromp J, Li M., Bioinformatics 18(3), 2002
PMID: 11934743

AUTHOR UNKNOWN, 0

A table-driven, full-sensitivity similarity search algorithm.
Myers G, Durbin R., J. Comput. Biol. 10(2), 2003
PMID: 12804086

SSAHA: a fast search method for large DNA databases.
Ning Z, Cox AJ, Mullikin JC., Genome Res. 11(10), 2001
PMID: 11591649

Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.
Pearson WR., Genomics 11(3), 1991
PMID: 1774068

Improved tools for biological sequence comparison.
Pearson WR, Lipman DJ., Proc. Natl. Acad. Sci. U.S.A. 85(8), 1988
PMID: 3162770

Identification of common molecular subsequences.
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238

AUTHOR UNKNOWN, 0

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 16597241
PubMed | Europe PMC

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld