Significant speedup of database searches with HMMs by search space reduction with PSSM family models

Beckstette, Michael; Homann, Robert; Giegerich, Robert; Kurtz, Stefan

Significant speedup of database searches with HMMs by search space reduction with PSSM family models

Beckstette M, Homann R, Giegerich R, Kurtz S (2009)
Bioinformatics 25(24): 3251-3258.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!

DOI

https://doi.org/10.1093/bioinformatics/btp593

Autor*in

Beckstette, Michael^UniBi; Homann, Robert^UniBi; Giegerich, Robert^UniBi; Kurtz, Stefan

Einrichtung

Technische Fakultät > BIBI
Technische Fakultät > AG Praktische Informatik
Centrum für Biotechnologie > Institut für Bioinformatik
Centrum für Biotechnologie > Arbeitsgruppe R. Giegerich
Technische Fakultät > AG Genominformatik

Abstract / Bemerkung

Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. Results: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92.

Erscheinungsjahr

2009

Zeitschriftentitel

Bioinformatics

Band

Ausgabe

Seite(n)

3251-3258

ISSN

1367-4803

eISSN

1460-2059

Page URI

https://pub.uni-bielefeld.de/record/1589458

Zitieren

Beckstette M, Homann R, Giegerich R, Kurtz S. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 2009;25(24):3251-3258.

Beckstette, M., Homann, R., Giegerich, R., & Kurtz, S. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25(24), 3251-3258. https://doi.org/10.1093/bioinformatics/btp593

Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. 2009. “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”. Bioinformatics 25 (24): 3251-3258.

Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics 25, 3251-3258.

Beckstette, M., et al., 2009. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25(24), p 3251-3258.

M. Beckstette, et al., “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”, Bioinformatics, vol. 25, 2009, pp. 3251-3258.

Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 25, 3251-3258 (2009).

Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. “Significant speedup of database searches with HMMs by search space reduction with PSSM family models”. Bioinformatics 25.24 (2009): 3251-3258.

Daten bereitgestellt von European Bioinformatics Institute (EBI)

6 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

A novel Zn2 Cys6 transcription factor BcGaaR regulates D-galacturonic acid utilization in Botrytis cinerea.
Zhang L, Lubbers RJ, Simon A, Stassen JH, Vargas Ribera PR, Viaud M, van Kan JA., Mol Microbiol 100(2), 2016
PMID: 26691528

Two temporal functions of Glass: Ommatidium patterning and photoreceptor differentiation.
Liang X, Mahato S, Hemmerich C, Zelhof AC., Dev Biol 414(1), 2016
PMID: 27105580

UProC: tools for ultra-fast protein domain classification.
Meinicke P., Bioinformatics 31(9), 2015
PMID: 25540185

Genome-wide profiling of Hfq-binding RNAs uncovers extensive post-transcriptional rewiring of major stress response and symbiotic regulons in Sinorhizobium meliloti.
Torres-Quesada O, Reinkensmeier J, Schlüter JP, Robledo M, Peregrina A, Giegerich R, Toro N, Becker A, Jiménez-Zurdo JI., RNA Biol 11(5), 2014
PMID: 24786641

Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns.
Meyer F, Kurtz S, Beckstette M., BMC Bioinformatics 14(), 2013
PMID: 23865810

Structator: fast index-based search for RNA sequence-structure patterns.
Meyer F, Kurtz S, Backofen R, Will S, Beckstette M., BMC Bioinformatics 12(), 2011
PMID: 21619640

42 References

Daten bereitgestellt von Europe PubMed Central.

Chaining algorithms for multiple genome comparison
Abouelhoda M, Ohlebusch E., 2005

Replacing suffix trees with enhanced suffix arrays
Abouelhoda M., 2004

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694

Data growth and its impact on the SCOP database: new developments.
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG., Nucleic Acids Res. 36(Database issue), 2007
PMID: 18000004

Fast index based algorithms and software for matching position specific scoring matrices
Beckstette M., 2006

Profile hidden Markov models.
Eddy SR., Bioinformatics 14(9), 1998
PMID: 9918945

Pfam: clans, web tools and services.
Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381856

The Pfam protein families database.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A., Nucleic Acids Res. 36(Database issue), 2007
PMID: 18039703

Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.
Gough J, Karplus K, Hughey R, Chothia C., J. Mol. Biol. 313(4), 2001
PMID: 11697912

Profile analysis: detection of distantly related proteins.
Gribskov M, McLachlan AD, Eisenberg D., Proc. Natl. Acad. Sci. U.S.A. 84(13), 1987
PMID: 3474607

The TIGRFAMs database of protein families.
Haft DH, Selengut JD, White O., Nucleic Acids Res. 31(1), 2003
PMID: 12520025

Using substitution probabilities to improve position-specific scoring matrices.
Henikoff JG, Henikoff S., Comput. Appl. Biosci. 12(2), 1996
PMID: 8744776

Increased coverage of protein families with the blocks database servers.
Henikoff JG, Greene EA, Pietrokovski S, Henikoff S., Nucleic Acids Res. 28(1), 2000
PMID: 10592233

Automated construction and graphical presentation of protein blocks from unaligned sequences
Henikoff S., 1995

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Thompson JD, Higgins DG, Gibson TJ., Nucleic Acids Res. 22(22), 1994
PMID: 7984417

InterPro: the integrative protein signature database.
Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJ, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C., Nucleic Acids Res. 37(Database issue), 2008
PMID: 18940856

Simple linear work suffix array construction
Kärkkäinen J, Sanders P., 2003

Linear-time longest-common-prefix computation in suffix arrays and its applications
Kasai T., 2001

MATCH: A tool for searching transcription factor binding sites in DNA sequences.
Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E., Nucleic Acids Res. 31(13), 2003
PMID: 12824369

SMART 5: domains in the context of genomes and networks.
Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381859

Fast target set reduction for large-scale protein function prediction: a multi-class multi-label machine learning approach
Lingner T, Meinicke P., 2008

Word correlation matrices for protein sequence analysis and remote homology detection
Lingner T, Meinicke P., 2008

CD-Search: protein domain annotations on the fly.
Marchler-Bauer A, Bryant SH., Nucleic Acids Res. 32(Web Server issue), 2004
PMID: 15215404

CDD: specific functional annotation with the Conserved Domain Database.
Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH., Nucleic Acids Res. 37(Database issue), 2008
PMID: 18984618

UFO: a web server for ultra-fast functional profiling of whole genome protein sequences
Meinicke P., 2009

The PANTHER database of protein families, subfamilies, functions and pathways.
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608197

MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data.
Quandt K, Frech K, Karas H, Wingender E, Werner T., Nucleic Acids Res. 23(23), 1995
PMID: 8532532

InterProScan: protein domains identifier.
Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R., Nucleic Acids Res. 33(Web Server issue), 2005
PMID: 15980438

Dynamic programming algorithms for two statistical problems in computational biology
Rahmann S., 2003

Phase4: automatic evaluation of database search methods.
Rehmsmeier M., Brief. Bioinformatics 3(4), 2002
PMID: 12511063

FingerPRINTScan: intelligent searching of the PRINTS motif database.
Scordis P, Flower DR, Attwood TK., Bioinformatics 15(10), 1999
PMID: 10705433

Next-generation DNA sequencing.
Shendure J, Ji H., Nat. Biotechnol. 26(10), 2008
PMID: 18846087

Searching for patterns in protein and nucleic acid sequences.
Staden R., Meth. Enzymol. 183(), 1990
PMID: 1690333

Designing patterns and profiles for faster HMM search.
Sun Y, Buhler J., IEEE/ACM Trans Comput Biol Bioinform 6(2), 2009
PMID: 19407348

Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
Tatusov RL, Altschul SF, Koonin EV., Proc. Natl. Acad. Sci. U.S.A. 91(25), 1994
PMID: 7991589

Efficient and accurate P-value computation for Position Weight Matrices.
Touzet H, Varre JS., Algorithms Mol Biol 2(), 2007
PMID: 18072973

Accelerating HMMER sequence analysis suite using conventional processors
Walters J., 2006

PIRSF: family classification system at the Protein Information Resource.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681371

The Universal Protein Resource (UniProt): an expanding universe of protein information.
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381842

Fast probabilistic analysis of sequence function using scoring matrices.
Wu TD, Nevill-Manning CG, Brutlag DL., Bioinformatics 16(3), 2000
PMID: 10869016

Gene3D: modelling protein structure, function and evolution.
Yeats C, Maibaum M, Marsden R, Dibley M, Lee D, Addou S, Orengo CA., Nucleic Acids Res. 34(Database issue), 2006
PMID: 16381865

Computing exact P-values for DNA motifs.
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ., Bioinformatics 23(5), 2007
PMID: 17237046

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 19828575
PubMed | Europe PMC

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld