Efficient q-gram filters for finding all epsilon-matches over a given length

Rasmussen KR, Stoye J, Myers EW (2006)

Journal Article | Published | English

No fulltext has been uploaded

; ;
Fast and exact comparison of large genomic sequences remains a challenging task in biosequence analysis. We consider the problem of finding all epsilon-matches between two sequences, i.e., all local alignments over a given length with an error rate of at most epsilon. We study this problem theoretically, giving an efficient q-gram filter for solving it. Two applications of the filter are also discussed, in particular genomic sequence assembly and BLAST-like sequence comparison. Our results show that the method is 25 times faster than BLAST, while not being heuristic.
Publishing Year

Cite this

Rasmussen KR, Stoye J, Myers EW. Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY. 2006;13(2):296-308.
Rasmussen, K. R., Stoye, J., & Myers, E. W. (2006). Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY, 13(2), 296-308.
Rasmussen, K. R., Stoye, J., and Myers, E. W. (2006). Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY 13, 296-308.
Rasmussen, K.R., Stoye, J., & Myers, E.W., 2006. Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY, 13(2), p 296-308.
K.R. Rasmussen, J. Stoye, and E.W. Myers, “Efficient q-gram filters for finding all epsilon-matches over a given length”, JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 13, 2006, pp. 296-308.
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient q-gram filters for finding all epsilon-matches over a given length. JOURNAL OF COMPUTATIONAL BIOLOGY. 13, 296-308 (2006).
Rasmussen, Kim R., Stoye, Jens, and Myers, Eugene W. “Efficient q-gram filters for finding all epsilon-matches over a given length”. JOURNAL OF COMPUTATIONAL BIOLOGY 13.2 (2006): 296-308.
This data publication is cited in the following publications:
This publication cites the following data publications:

25 Citations in Europe PMC

Data provided by Europe PubMed Central.

BitMapper: an efficient all-mapper based on bit-vector computing.
Cheng H, Jiang H, Yang J, Xu Y, Shang Y., BMC Bioinformatics 16(), 2015
PMID: 26063651
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM., Nat. Biotechnol. 33(6), 2015
PMID: 26006009
IMSEQ--a fast and error aware approach to immunogenetic sequence analysis.
Kuchenbecker L, Nienen M, Hecht J, Neumann AU, Babel N, Reinert K, Robinson PN., Bioinformatics 31(18), 2015
PMID: 25987567
Alignment of Next-Generation Sequencing Reads.
Reinert K, Langmead B, Weese D, Evers DJ., Annu Rev Genomics Hum Genet 16(), 2015
PMID: 25939052
Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone.
Trappe K, Emde AK, Ehrlich HC, Reinert K., Bioinformatics 30(24), 2014
PMID: 25028727
The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.
Liao Y, Smyth GK, Shi W., Nucleic Acids Res. 41(10), 2013
PMID: 23558742
Normalized N50 assembly metric using gap-restricted co-linear chaining.
Makinen V, Salmela L, Ylinen J., BMC Bioinformatics 13(), 2012
PMID: 23031320
Long read alignment based on maximal exact match seeds.
Liu Y, Schmidt B., Bioinformatics 28(18), 2012
PMID: 22962447
RazerS 3: faster, fully sensitive read mapping.
Weese D, Holtgrewe M, Reinert K., Bioinformatics 28(20), 2012
PMID: 22923295
Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis.
Cordero F, Beccuti M, Arigoni M, Donatelli S, Calogero RA., PLoS ONE 7(2), 2012
PMID: 22363693
STELLAR: fast and exact local alignments.
Kehr B, Weese D, Reinert K., BMC Bioinformatics 12 Suppl 9(), 2011
PMID: 22151882
Considering transposable element diversification in de novo annotation approaches.
Flutre T, Duprat E, Feuillet C, Quesneville H., PLoS ONE 6(1), 2011
PMID: 21304975
SHRiMP2: sensitive yet practical SHort Read Mapping.
David M, Dzamba M, Lister D, Ilie L, Brudno M., Bioinformatics 27(7), 2011
PMID: 21278192
Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing.
Misra S, Agrawal A, Liao WK, Choudhary A., Bioinformatics 27(2), 2011
PMID: 21088030
Phylogenetic comparative assembly.
Husemann P, Stoye J., Algorithms Mol Biol 5(), 2010
PMID: 20047659
MicroRazerS: rapid alignment of small RNA reads.
Emde AK, Grunert M, Weese D, Reinert K, Sperling SR., Bioinformatics 26(1), 2010
PMID: 19880369
RazerS--fast read mapping with sensitivity control.
Weese D, Emde AK, Rausch T, Doring A, Reinert K., Genome Res. 19(9), 2009
PMID: 19592482
Lossless filter for multiple repeats with bounded edit distance.
Peterlongo P, Sacomoto GA, do Lago AP, Pisanti N, Sagot MF., Algorithms Mol Biol 4(), 2009
PMID: 19183438
A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads.
Rausch T, Koren S, Denisov G, Weese D, Emde AK, Doring A, Reinert K., Bioinformatics 25(9), 2009
PMID: 19269990
Massive parallel bisulfite sequencing of CG-rich DNA fragments reveals that methylation of many X-chromosomal CpG islands in female blood DNA is incomplete.
Zeschnigk M, Martin M, Betzl G, Kalbe A, Sirsch C, Buiting K, Gross S, Fritzilas E, Frey B, Rahmann S, Horsthemke B., Hum. Mol. Genet. 18(8), 2009
PMID: 19223391
Locked nucleic acid-based in situ detection of microRNAs in mouse tissue sections.
Obernosterer G, Martinez J, Alenius M., Nat Protoc 2(6), 2007
PMID: 17571058
PILER-CR: fast and accurate identification of CRISPR repeats.
Edgar RC., BMC Bioinformatics 8(), 2007
PMID: 17239253

14 References

Data provided by Europe PubMed Central.

Basic local alignment search tool.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., J. Mol. Biol. 215(3), 1990
PMID: 2231712
Approximate string-matching with q-grams and maximal matches
Ukkonen, Theoretical Computer Science 92(1), 1992
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
A sublinear algorithm for approximate keyword searching
Myers, Algorithmica 12(4-5), 1994
Improved tools for biological sequence comparison.
Pearson WR, Lipman DJ., Proc. Natl. Acad. Sci. U.S.A. 85(8), 1988
PMID: 3162770
PatternHunter: faster and more sensitive homology search.
Ma B, Tromp J, Li M., Bioinformatics 18(3), 2002
PMID: 11934743
BLAT--the BLAST-like alignment tool.
Kent WJ., Genome Res. 12(4), 2002
PMID: 11932250
SSAHA: a fast search method for large DNA databases.
Ning Z, Cox AJ, Mullikin JC., Genome Res. 11(10), 2001
PMID: 11591649
Sparse dynamic programming I: linear cost functions
Eppstein, Journal of the ACM 39(3), 1992
A fast bit-vector algorithm for approximate string matching based on dynamic programming
Myers, Journal of the ACM 46(3), 1999
A table-driven, full-sensitivity similarity search algorithm.
Myers G, Durbin R., J. Comput. Biol. 10(2), 2003
PMID: 12804086
Identification of common molecular subsequences.
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238
Sublinear approximate string matching and biological applications
Chang, Algorithmica 12(4-5), 1994


0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®


PMID: 16597241
PubMed | Europe PMC

Search this title in

Google Scholar