Significant speedup of database searches with HMMs by search space reduction with PSSM family models

Beckstette M, Homann R, Giegerich R, Kurtz S (2009)
Bioinformatics 25(24): 3251-3258.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
 
Download
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Abstract / Bemerkung
Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. Results: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92.
Erscheinungsjahr
2009
Zeitschriftentitel
Bioinformatics
Band
25
Ausgabe
24
Seite(n)
3251-3258
ISSN
1367-4803
eISSN
1460-2059
Page URI
https://pub.uni-bielefeld.de/record/1589458

Zitieren

Beckstette M, Homann R, Giegerich R, Kurtz S. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 2009;25(24):3251-3258.
Beckstette, M., Homann, R., Giegerich, R., & Kurtz, S. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25(24), 3251-3258. https://doi.org/10.1093/bioinformatics/btp593
Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. 2009. “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”. Bioinformatics 25 (24): 3251-3258.
Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics 25, 3251-3258.
Beckstette, M., et al., 2009. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25(24), p 3251-3258.
M. Beckstette, et al., “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”, Bioinformatics, vol. 25, 2009, pp. 3251-3258.
Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 25, 3251-3258 (2009).
Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”. Bioinformatics 25.24 (2009): 3251-3258.

6 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

A novel Zn2 Cys6 transcription factor BcGaaR regulates D-galacturonic acid utilization in Botrytis cinerea.
Zhang L, Lubbers RJ, Simon A, Stassen JH, Vargas Ribera PR, Viaud M, van Kan JA., Mol Microbiol 100(2), 2016
PMID: 26691528
Two temporal functions of Glass: Ommatidium patterning and photoreceptor differentiation.
Liang X, Mahato S, Hemmerich C, Zelhof AC., Dev Biol 414(1), 2016
PMID: 27105580
UProC: tools for ultra-fast protein domain classification.
Meinicke P., Bioinformatics 31(9), 2015
PMID: 25540185
Genome-wide profiling of Hfq-binding RNAs uncovers extensive post-transcriptional rewiring of major stress response and symbiotic regulons in Sinorhizobium meliloti.
Torres-Quesada O, Reinkensmeier J, Schlüter JP, Robledo M, Peregrina A, Giegerich R, Toro N, Becker A, Jiménez-Zurdo JI., RNA Biol 11(5), 2014
PMID: 24786641
Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns.
Meyer F, Kurtz S, Beckstette M., BMC Bioinformatics 14(), 2013
PMID: 23865810
Structator: fast index-based search for RNA sequence-structure patterns.
Meyer F, Kurtz S, Backofen R, Will S, Beckstette M., BMC Bioinformatics 12(), 2011
PMID: 21619640

42 References

Daten bereitgestellt von Europe PubMed Central.

Chaining algorithms for multiple genome comparison
Abouelhoda M, Ohlebusch E., 2005
Replacing suffix trees with enhanced suffix arrays
Abouelhoda M., 2004
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
Data growth and its impact on the SCOP database: new developments.
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG., Nucleic Acids Res. 36(Database issue), 2007
PMID: 18000004
Fast index based algorithms and software for matching position specific scoring matrices
Beckstette M., 2006
Profile hidden Markov models.
Eddy SR., Bioinformatics 14(9), 1998
PMID: 9918945
Pfam: clans, web tools and services.
Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381856
The Pfam protein families database.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A., Nucleic Acids Res. 36(Database issue), 2007
PMID: 18039703
Profile analysis: detection of distantly related proteins.
Gribskov M, McLachlan AD, Eisenberg D., Proc. Natl. Acad. Sci. U.S.A. 84(13), 1987
PMID: 3474607
The TIGRFAMs database of protein families.
Haft DH, Selengut JD, White O., Nucleic Acids Res. 31(1), 2003
PMID: 12520025
Using substitution probabilities to improve position-specific scoring matrices.
Henikoff JG, Henikoff S., Comput. Appl. Biosci. 12(2), 1996
PMID: 8744776
Increased coverage of protein families with the blocks database servers.
Henikoff JG, Greene EA, Pietrokovski S, Henikoff S., Nucleic Acids Res. 28(1), 2000
PMID: 10592233
Automated construction and graphical presentation of protein blocks from unaligned sequences
Henikoff S., 1995
InterPro: the integrative protein signature database.
Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJ, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C., Nucleic Acids Res. 37(Database issue), 2008
PMID: 18940856
Simple linear work suffix array construction
Kärkkäinen J, Sanders P., 2003
Linear-time longest-common-prefix computation in suffix arrays and its applications
Kasai T., 2001
MATCH: A tool for searching transcription factor binding sites in DNA sequences.
Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E., Nucleic Acids Res. 31(13), 2003
PMID: 12824369
SMART 5: domains in the context of genomes and networks.
Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381859
Fast target set reduction for large-scale protein function prediction: a multi-class multi-label machine learning approach
Lingner T, Meinicke P., 2008
Word correlation matrices for protein sequence analysis and remote homology detection
Lingner T, Meinicke P., 2008
CD-Search: protein domain annotations on the fly.
Marchler-Bauer A, Bryant SH., Nucleic Acids Res. 32(Web Server issue), 2004
PMID: 15215404
CDD: specific functional annotation with the Conserved Domain Database.
Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH., Nucleic Acids Res. 37(Database issue), 2008
PMID: 18984618
UFO: a web server for ultra-fast functional profiling of whole genome protein sequences
Meinicke P., 2009
The PANTHER database of protein families, subfamilies, functions and pathways.
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608197
MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data.
Quandt K, Frech K, Karas H, Wingender E, Werner T., Nucleic Acids Res. 23(23), 1995
PMID: 8532532
InterProScan: protein domains identifier.
Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R., Nucleic Acids Res. 33(Web Server issue), 2005
PMID: 15980438
Dynamic programming algorithms for two statistical problems in computational biology
Rahmann S., 2003
Phase4: automatic evaluation of database search methods.
Rehmsmeier M., Brief. Bioinformatics 3(4), 2002
PMID: 12511063
FingerPRINTScan: intelligent searching of the PRINTS motif database.
Scordis P, Flower DR, Attwood TK., Bioinformatics 15(10), 1999
PMID: 10705433
Next-generation DNA sequencing.
Shendure J, Ji H., Nat. Biotechnol. 26(10), 2008
PMID: 18846087
Searching for patterns in protein and nucleic acid sequences.
Staden R., Meth. Enzymol. 183(), 1990
PMID: 1690333
Designing patterns and profiles for faster HMM search.
Sun Y, Buhler J., IEEE/ACM Trans Comput Biol Bioinform 6(2), 2009
PMID: 19407348
Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
Tatusov RL, Altschul SF, Koonin EV., Proc. Natl. Acad. Sci. U.S.A. 91(25), 1994
PMID: 7991589
Efficient and accurate P-value computation for Position Weight Matrices.
Touzet H, Varre JS., Algorithms Mol Biol 2(), 2007
PMID: 18072973
Accelerating HMMER sequence analysis suite using conventional processors
Walters J., 2006
PIRSF: family classification system at the Protein Information Resource.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681371
The Universal Protein Resource (UniProt): an expanding universe of protein information.
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381842
Fast probabilistic analysis of sequence function using scoring matrices.
Wu TD, Nevill-Manning CG, Brutlag DL., Bioinformatics 16(3), 2000
PMID: 10869016
Gene3D: modelling protein structure, function and evolution.
Yeats C, Maibaum M, Marsden R, Dibley M, Lee D, Addou S, Orengo CA., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381865
Computing exact P-values for DNA motifs.
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ., Bioinformatics 23(5), 2007
PMID: 17237046
Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®
Quellen

PMID: 19828575
PubMed | Europe PMC

Suchen in

Google Scholar