Significant speedup of database searches with HMMs by search space reduction with PSSM family models

Beckstette M, Homann R, Giegerich R, Kurtz S (2009)
Bioinformatics 25(24): 3251-3258.

Journal Article | Published | English

No fulltext has been uploaded

Author
; ; ;
Abstract
Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. Results: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92.
Publishing Year
ISSN
eISSN
PUB-ID

Cite this

Beckstette M, Homann R, Giegerich R, Kurtz S. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 2009;25(24):3251-3258.
Beckstette, M., Homann, R., Giegerich, R., & Kurtz, S. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25(24), 3251-3258.
Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics 25, 3251-3258.
Beckstette, M., et al., 2009. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25(24), p 3251-3258.
M. Beckstette, et al., “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”, Bioinformatics, vol. 25, 2009, pp. 3251-3258.
Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 25, 3251-3258 (2009).
Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”. Bioinformatics 25.24 (2009): 3251-3258.
This data publication is cited in the following publications:
This publication cites the following data publications:

4 Citations in Europe PMC

Data provided by Europe PubMed Central.

UProC: tools for ultra-fast protein domain classification.
Meinicke P., Bioinformatics 31(9), 2015
PMID: 25540185
Genome-wide profiling of Hfq-binding RNAs uncovers extensive post-transcriptional rewiring of major stress response and symbiotic regulons in Sinorhizobium meliloti.
Torres-Quesada O, Reinkensmeier J, Schluter JP, Robledo M, Peregrina A, Giegerich R, Toro N, Becker A, Jimenez-Zurdo JI., RNA Biol 11(5), 2014
PMID: 24786641
Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns.
Meyer F, Kurtz S, Beckstette M., BMC Bioinformatics 14(), 2013
PMID: 23865810
Structator: fast index-based search for RNA sequence-structure patterns.
Meyer F, Kurtz S, Backofen R, Will S, Beckstette M., BMC Bioinformatics 12(), 2011
PMID: 21619640

42 References

Data provided by Europe PubMed Central.

The PANTHER database of protein families, subfamilies, functions and pathways.
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608197
MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data.
Quandt K, Frech K, Karas H, Wingender E, Werner T., Nucleic Acids Res. 23(23), 1995
PMID: 8532532
InterProScan: protein domains identifier.
Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R., Nucleic Acids Res. 33(Web Server issue), 2005
PMID: 15980438
Dynamic programming algorithms for two statistical problems in computational biology
Rahmann S., 2003
Phase4: automatic evaluation of database search methods.
Rehmsmeier M., Brief. Bioinformatics 3(4), 2002
PMID: 12511063
FingerPRINTScan: intelligent searching of the PRINTS motif database.
Scordis P, Flower DR, Attwood TK., Bioinformatics 15(10), 1999
PMID: 10705433
Next-generation DNA sequencing.
Shendure J, Ji H., Nat. Biotechnol. 26(10), 2008
PMID: 18846087
Searching for patterns in protein and nucleic acid sequences.
Staden R., Meth. Enzymol. 183(), 1990
PMID: 1690333
Designing patterns and profiles for faster HMM search.
Sun Y, Buhler J., IEEE/ACM Trans Comput Biol Bioinform 6(2), 2009
PMID: 19407348
Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
Tatusov RL, Altschul SF, Koonin EV., Proc. Natl. Acad. Sci. U.S.A. 91(25), 1994
PMID: 7991589
Efficient and accurate P-value computation for Position Weight Matrices.
Touzet H, Varre JS., Algorithms Mol Biol 2(), 2007
PMID: 18072973
Accelerating HMMER sequence analysis suite using conventional processors
Walters J., 2006
PIRSF: family classification system at the Protein Information Resource.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681371
The Universal Protein Resource (UniProt): an expanding universe of protein information.
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381842
Fast probabilistic analysis of sequence function using scoring matrices.
Wu TD, Nevill-Manning CG, Brutlag DL., Bioinformatics 16(3), 2000
PMID: 10869016
Gene3D: modelling protein structure, function and evolution.
Yeats C, Maibaum M, Marsden R, Dibley M, Lee D, Addou S, Orengo CA., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381865
Computing exact P-values for DNA motifs.
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ., Bioinformatics 23(5), 2007
PMID: 17237046

Export

0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®

Sources

PMID: 19828575
PubMed | Europe PMC

Search this title in

Google Scholar