# Fast index based algorithms and software for matching position specific scoring matrices

Beckstette M, Homann R, Giegerich R, Kurtz S (2006)

BMC Bioinformatics 7(1): 389.

Download

*Journal Article*|

*Published*|

*English*

Author

Department

Abstract

Background: In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results: We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion: Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A|^m + m - 1, where m is the length of the PSSM and A a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.

Publishing Year

ISSN

PUB-ID

### Cite this

Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices.

*BMC Bioinformatics*. 2006;7(1):389.Beckstette, M., Homann, R., Giegerich, R., & Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices.

*BMC Bioinformatics*,*7*(1), 389. doi:10.1186/1471-2105-7-389Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices.

*BMC Bioinformatics*7, 389.Beckstette, M., et al., 2006. Fast index based algorithms and software for matching position specific scoring matrices.

*BMC Bioinformatics*, 7(1), p 389. M. Beckstette, et al., “Fast index based algorithms and software for matching position specific scoring matrices”,

*BMC Bioinformatics*, vol. 7, 2006, pp. 389. Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 7, 389 (2006).

Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. “Fast index based algorithms and software for matching position specific scoring matrices”.

*BMC Bioinformatics*7.1 (2006): 389.
This data publication is cited in the following publications:

This publication cites the following data publications:

### 55 Citations in Europe PMC

Data provided by Europe PubMed Central.

Parametric bootstrapping for biological sequence motifs.

O'Neill PK, Erill I.,

PMID: 27716039

O'Neill PK, Erill I.,

*BMC Bioinformatics*17(1), 2016PMID: 27716039

Zinc regulates a key transcriptional pathway for epileptogenesis via metal-regulatory transcription factor 1.

van Loo KM, Schaub C, Pitsch J, Kulbida R, Opitz T, Ekstein D, Dalal A, Urbach H, Beck H, Yaari Y, Schoch S, Becker AJ.,

PMID: 26498180

van Loo KM, Schaub C, Pitsch J, Kulbida R, Opitz T, Ekstein D, Dalal A, Urbach H, Beck H, Yaari Y, Schoch S, Becker AJ.,

*Nat Commun*6(), 2015PMID: 26498180

CMRegNet-An interspecies reference database for corynebacterial and mycobacterial regulatory networks.

Abreu VA, Almeida S, Tiwari S, Hassan SS, Mariano D, Silva A, Baumbach J, Azevedo V, Rottger R.,

PMID: 26062809

Abreu VA, Almeida S, Tiwari S, Hassan SS, Mariano D, Silva A, Baumbach J, Azevedo V, Rottger R.,

*BMC Genomics*16(), 2015PMID: 26062809

Combining docking site and phosphosite predictions to find new substrates: identification of smoothelin-like-2 (SMTNL2) as a c-Jun N-terminal kinase (JNK) substrate.

Gordon EA, Whisenant TC, Zeller M, Kaake RM, Gordon WM, Krotee P, Patel V, Huang L, Baldi P, Bardwell L.,

PMID: 23981301

Gordon EA, Whisenant TC, Zeller M, Kaake RM, Gordon WM, Krotee P, Patel V, Huang L, Baldi P, Bardwell L.,

*Cell. Signal.*25(12), 2013PMID: 23981301

MotifAdjuster: a tool for computational reassessment of transcription factor binding site annotations.

Keilwagen J, Baumbach J, Kohl TA, Grosse I.,

PMID: 19409082

Keilwagen J, Baumbach J, Kohl TA, Grosse I.,

*Genome Biol.*10(5), 2009PMID: 19409082

### 36 References

Data provided by Europe PubMed Central.

Simple Linear Work Suffix Array Construction

Kärkkäinen J, Sanders P., 2003

Kärkkäinen J, Sanders P., 2003

Linear-time Longest-Common-Prefix Computation in Suffix Arrays and its Applications

Kasai T, Lee G, Arimura H, Arikawa S, Park K., 2001

Kasai T, Lee G, Arimura H, Arikawa S, Park K., 2001

A Combinatorial Problem

de N., 1946

de N., 1946

Reduction of protein sequence complexity by residue grouping.

Li T, Fan K, Wang J, Wang W.,

PMID: 12826723

Li T, Fan K, Wang J, Wang W.,

*Protein Eng.*16(5), 2003PMID: 12826723

Simplified amino acid alphabets for protein fold recognition and implications for folding.

Murphy LR, Wallqvist A, Levy RM.,

PMID: 10775656

Murphy LR, Wallqvist A, Levy RM.,

*Protein Eng.*13(3), 2000PMID: 10775656

Castillo G., 1988

Embrechts P, Klüppelberg C, Mikosch T., 1997

Approximations to profile score distributions

Goldstein L, Waterman M., 1994

Goldstein L, Waterman M., 1994

The Vmatch large scale sequence analysis software

Kurtz S., 2005

Kurtz S., 2005

CisML: an XML-based format for sequence motif detection software.

Haverty PM, Weng Z.,

PMID: 15001475

Haverty PM, Weng Z.,

*Bioinformatics*20(11), 2004PMID: 15001475

A H+-gated urea channel: the link between Helicobacter pylori urease and gastric colonization.

Weeks DL, Eskandari S, Scott DR, Sachs G.,

PMID: 10642549

Weeks DL, Eskandari S, Scott DR, Sachs G.,

*Science*287(5452), 2000PMID: 10642549

### Export

0 Marked Publications### Web of Science

View record in Web of Science®### Sources

PMID: 16930469

PubMed | Europe PMC