Fast index based algorithms and software for matching position specific scoring matrices

Beckstette M, Homann R, Giegerich R, Kurtz S (2006)
BMC Bioinformatics 7(1): 389.

Download
OA
Zeitschriftenaufsatz | Veröffentlicht | Englisch
Autor
; ; ;
Abstract / Bemerkung
Background: In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results: We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion: Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A|^m + m - 1, where m is the length of the PSSM and A a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
Erscheinungsjahr
Zeitschriftentitel
BMC Bioinformatics
Band
7
Zeitschriftennummer
1
Seite
389
ISSN
PUB-ID

Zitieren

Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 2006;7(1):389.
Beckstette, M., Homann, R., Giegerich, R., & Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, 7(1), 389. doi:10.1186/1471-2105-7-389
Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 7, 389.
Beckstette, M., et al., 2006. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, 7(1), p 389.
M. Beckstette, et al., “Fast index based algorithms and software for matching position specific scoring matrices”, BMC Bioinformatics, vol. 7, 2006, pp. 389.
Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 7, 389 (2006).
Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. “Fast index based algorithms and software for matching position specific scoring matrices”. BMC Bioinformatics 7.1 (2006): 389.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
1970-01-01T00:00:00Z

63 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

A probabilistic method for small RNA flowgram matching.
Vacic V, Jin H, Zhu JK, Lonardi S., Pac Symp Biocomput (), 2008
PMID: 18229677
ReXSpecies--a tool for the analysis of the evolution of gene regulation across species.
Struckmann S, Araúzo-Bravo MJ, Schöler HR, Reinbold RA, Fuellen G., BMC Evol Biol 8(), 2008
PMID: 18410675
Efficient representation and P-value computation for high-order Markov motifs.
da Fonseca PG, Guimarães KS, Sagot MF., Bioinformatics 24(16), 2008
PMID: 18689819
CoryneRegNet 3.0--an interactive systems biology platform for the analysis of gene regulatory networks in corynebacteria and Escherichia coli.
Baumbach J, Wittkop T, Rademacher K, Rahmann S, Brinkrolf K, Tauch A., J Biotechnol 129(2), 2007
PMID: 17229482
Computing exact P-values for DNA motifs.
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ., Bioinformatics 23(5), 2007
PMID: 17237046
Statistical significance of cis-regulatory modules.
Schones DE, Smith AD, Zhang MQ., BMC Bioinformatics 8(), 2007
PMID: 17241466
Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.
Wittkop T, Baumbach J, Lobo FP, Rahmann S., BMC Bioinformatics 8(), 2007
PMID: 17941985
Reduced alphabet motif methodology for GPCR annotation.
Gangal R, Kumar KK., J Biomol Struct Dyn 25(3), 2007
PMID: 17937491
Efficient and accurate P-value computation for Position Weight Matrices.
Touzet H, Varré JS., Algorithms Mol Biol 2(), 2007
PMID: 18072973
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469

36 References

Daten bereitgestellt von Europe PubMed Central.

Simple Linear Work Suffix Array Construction
Kärkkäinen J, Sanders P., 2003
Linear-time Longest-Common-Prefix Computation in Suffix Arrays and its Applications
Kasai T, Lee G, Arimura H, Arikawa S, Park K., 2001
A Combinatorial Problem
de N., 1946
Reduction of protein sequence complexity by residue grouping.
Li T, Fan K, Wang J, Wang W., Protein Eng. 16(5), 2003
PMID: 12826723
Simplified amino acid alphabets for protein fold recognition and implications for folding.
Murphy LR, Wallqvist A, Levy RM., Protein Eng. 13(3), 2000
PMID: 10775656

Castillo G., 1988

Embrechts P, Klüppelberg C, Mikosch T., 1997
Approximations to profile score distributions
Goldstein L, Waterman M., 1994
The Vmatch large scale sequence analysis software
Kurtz S., 2005
CisML: an XML-based format for sequence motif detection software.
Haverty PM, Weng Z., Bioinformatics 20(11), 2004
PMID: 15001475
A H+-gated urea channel: the link between Helicobacter pylori urease and gastric colonization.
Weeks DL, Eskandari S, Scott DR, Sachs G., Science 287(5452), 2000
PMID: 10642549

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 16930469
PubMed | Europe PMC

Suchen in

Google Scholar