Fast index based algorithms and software for matching position specific scoring matrices

Beckstette M, Homann R, Giegerich R, Kurtz S (2006)
BMC Bioinformatics 7(1): 389.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
Abstract / Bemerkung
Background: In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results: We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion: Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A|^m + m - 1, where m is the length of the PSSM and A a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
BMC Bioinformatics
Page URI


Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 2006;7(1): 389.
Beckstette, M., Homann, R., Giegerich, R., & Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, 7(1), 389.
Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. 2006. “Fast index based algorithms and software for matching position specific scoring matrices”. BMC Bioinformatics 7 (1): 389.
Beckstette, M., Homann, R., Giegerich, R., and Kurtz, S. (2006). Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 7:389.
Beckstette, M., et al., 2006. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics, 7(1): 389.
M. Beckstette, et al., “Fast index based algorithms and software for matching position specific scoring matrices”, BMC Bioinformatics, vol. 7, 2006, : 389.
Beckstette, M., Homann, R., Giegerich, R., Kurtz, S.: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 7, : 389 (2006).
Beckstette, Michael, Homann, Robert, Giegerich, Robert, and Kurtz, Stefan. “Fast index based algorithms and software for matching position specific scoring matrices”. BMC Bioinformatics 7.1 (2006): 389.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]
Access Level
OA Open Access
Zuletzt Hochgeladen
MD5 Prüfsumme

66 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Fast sequence analysis based on diamond sampling.
Gao L, Bao W, Zhang H, Yuan CA, Huang DS., PLoS One 13(6), 2018
PMID: 29953448
Dempster-Shafer Theory for the Prediction of Auxin-Response Elements (AuxREs) in Plant Genomes.
Sghaier N, Ben Ayed R, Ben Marzoug R, Rebai A., Biomed Res Int 2018(), 2018
PMID: 30515394
Improved genome assembly of American alligator genome reveals conserved architecture of estrogen signaling.
Rice ES, Kohno S, John JS, Pham S, Howard J, Lareau LF, O'Connell BL, Hickey G, Armstrong J, Deran A, Fiddes I, Platt RN, Gresham C, McCarthy F, Kern C, Haan D, Phan T, Schmidt C, Sanford JR, Ray DA, Paten B, Guillette LJ, Green RE., Genome Res 27(5), 2017
PMID: 28137821
Improving Gene Regulatory Network Inference by Incorporating Rates of Transcriptional Changes.
Desai JS, Sartor RC, Lawas LM, Jagadish SVK, Doherty CJ., Sci Rep 7(1), 2017
PMID: 29222512
Function and evolution of local repeats in the Firre locus.
Hacisuleyman E, Shukla CJ, Weiner CL, Rinn JL., Nat Commun 7(), 2016
PMID: 27009974
Parametric bootstrapping for biological sequence motifs.
O'Neill PK, Erill I., BMC Bioinformatics 17(1), 2016
PMID: 27716039
Identification and characterization of VpsR and VpsT binding sites in Vibrio cholerae.
Zamorano-Sánchez D, Fong JC, Kilic S, Erill I, Yildiz FH., J Bacteriol 197(7), 2015
PMID: 25622616
CMRegNet-An interspecies reference database for corynebacterial and mycobacterial regulatory networks.
Abreu VA, Almeida S, Tiwari S, Hassan SS, Mariano D, Silva A, Baumbach J, Azevedo V, Röttger R., BMC Genomics 16(), 2015
PMID: 26062809
Zinc regulates a key transcriptional pathway for epileptogenesis via metal-regulatory transcription factor 1.
van Loo KM, Schaub C, Pitsch J, Kulbida R, Opitz T, Ekstein D, Dalal A, Urbach H, Beck H, Yaari Y, Schoch S, Becker AJ., Nat Commun 6(), 2015
PMID: 26498180
Adaptable probabilistic mapping of short reads using position specific scoring matrices.
Kerpedjiev P, Frellsen J, Lindgreen S, Krogh A., BMC Bioinformatics 15(), 2014
PMID: 24717095
Genes and signaling networks regulated during zebrafish optic vesicle morphogenesis.
Yin J, Morrissey ME, Shine L, Kennedy C, Higgins DG, Kennedy BN., BMC Genomics 15(), 2014
PMID: 25266257
Transposable elements modulate human RNA abundance and splicing via specific RNA-protein interactions.
Kelley DR, Hendrickson DG, Tenen D, Rinn JL., Genome Biol 15(12), 2014
PMID: 25572935
High-resolution detection of DNA binding sites of the global transcriptional regulator GlxR in Corynebacterium glutamicum.
Jungwirth B, Sala C, Kohl TA, Uplekar S, Baumbach J, Cole ST, Pühler A, Tauch A., Microbiology 159(pt 1), 2013
PMID: 23103979
Global mapping of transcription start sites and promoter motifs in the symbiotic α-proteobacterium Sinorhizobium meliloti 1021.
Schlüter JP, Reinkensmeier J, Barnett MJ, Lang C, Krol E, Giegerich R, Long SR, Becker A., BMC Genomics 14(), 2013
PMID: 23497287
Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns.
Meyer F, Kurtz S, Beckstette M., BMC Bioinformatics 14(), 2013
PMID: 23865810
Combining docking site and phosphosite predictions to find new substrates: identification of smoothelin-like-2 (SMTNL2) as a c-Jun N-terminal kinase (JNK) substrate.
Gordon EA, Whisenant TC, Zeller M, Kaake RM, Gordon WM, Krotee P, Patel V, Huang L, Baldi P, Bardwell L., Cell Signal 25(12), 2013
PMID: 23981301
Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor farmers.
Varshney RK, Chen W, Li Y, Bharti AK, Saxena RK, Schlueter JA, Donoghue MT, Azam S, Fan G, Whaley AM, Farmer AD, Sheridan J, Iwata A, Tuteja R, Penmetsa RV, Wu W, Upadhyaya HD, Yang SP, Shah T, Saxena KB, Michael T, McCombie WR, Yang B, Zhang G, Yang H, Wang J, Wang J, Spillane C, Cook DR, May GD, Xu X, Jackson SA., Nat Biotechnol 30(1), 2012
PMID: 22057054
CoryneRegNet 6.0--Updated database content, new analysis methods and novel features focusing on community demands.
Pauling J, Röttger R, Tauch A, Azevedo V, Baumbach J., Nucleic Acids Res 40(database issue), 2012
PMID: 22080556
On the trail of EHEC/EAEC--unraveling the gene regulatory networks of human pathogenic Escherichia coli bacteria.
Pauling J, Röttger R, Neuner A, Salgado H, Collado-Vides J, Kalaghatgi P, Azevedo V, Tauch A, Pühler A, Baumbach J., Integr Biol (Camb) 4(7), 2012
PMID: 22318347
Designing filters for fast-known NcRNA identification.
Sun Y, Buhler J, Yuan C., IEEE/ACM Trans Comput Biol Bioinform 9(3), 2012
PMID: 22084145
Phenylacetic acid catabolism and its transcriptional regulation in Corynebacterium glutamicum.
Chen X, Kohl TA, Rückert C, Rodionov DA, Li LH, Ding JY, Kalinowski J, Liu SJ., Appl Environ Microbiol 78(16), 2012
PMID: 22685150
Large-scale development of cost-effective single-nucleotide polymorphism marker assays for genetic mapping in pigeonpea and comparative mapping in legumes.
Saxena RK, Penmetsa RV, Upadhyaya HD, Kumar A, Carrasquilla-Garcia N, Schlueter JA, Farmer A, Whaley AM, Sarma BK, May GD, Cook DR, Varshney RK., DNA Res 19(6), 2012
PMID: 23103470
Finding significant matches of position weight matrices in linear time.
Pizzi C, Rastas P, Ukkonen E., IEEE/ACM Trans Comput Biol Bioinform 8(1), 2011
PMID: 21071798
Prediction and validation of promoters involved in the abscisic acid response in Physcomitrella patens.
Timmerhaus G, Hanke ST, Buchta K, Rensing SA., Mol Plant 4(4), 2011
PMID: 21398384
Defining the transcriptome assembly and its use for genome dynamics and transcriptome profiling studies in pigeonpea (Cajanus cajan L.).
Dubey A, Farmer A, Schlueter J, Cannon SB, Abernathy B, Tuteja R, Woodward J, Shah T, Mulasmanovic B, Kudapa H, Raju NL, Gothalwal R, Pande S, Xiao Y, Town CD, Singh NK, May GD, Jackson S, Varshney RK., DNA Res 18(3), 2011
PMID: 21565938
Structator: fast index-based search for RNA sequence-structure patterns.
Meyer F, Kurtz S, Backofen R, Will S, Beckstette M., BMC Bioinformatics 12(), 2011
PMID: 21619640
STEME: efficient EM to find motifs in large data sets.
Reid JE, Wernisch L., Nucleic Acids Res 39(18), 2011
PMID: 21785132
PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species.
Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL, Shukla MP, Dalay O, Driscoll T, Hix D, Mane SP, Mao C, Nordberg EK, Scott M, Schulman JR, Snyder EE, Sullivan DE, Wang C, Warren A, Williams KP, Xue T, Yoo HS, Zhang C, Zhang Y, Will R, Kenyon RW, Sobral BW., Infect Immun 79(11), 2011
PMID: 21896772
TFRank: network-based prioritization of regulatory associations underlying transcriptional responses.
Gonçalves JP, Francisco AP, Mira NP, Teixeira MC, Sá-Correia I, Oliveira AL, Madeira SC., Bioinformatics 27(22), 2011
PMID: 21965816
Genome sequence of the palaeopolyploid soybean.
Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, Gill N, Joshi T, Libault M, Sethuraman A, Zhang XC, Shinozaki K, Nguyen HT, Wing RA, Cregan P, Specht J, Grimwood J, Rokhsar D, Stacey G, Shoemaker RC, Jackson SA., Nature 463(7278), 2010
PMID: 20075913
A genome-wide survey of sRNAs in the symbiotic nitrogen-fixing alpha-proteobacterium Sinorhizobium meliloti.
Schlüter JP, Reinkensmeier J, Daschkey S, Evguenieva-Hackenberg E, Janssen S, Jänicke S, Becker JD, Giegerich R, Becker A., BMC Genomics 11(), 2010
PMID: 20398411
ChIP-seq and functional analysis of the SOX2 gene in colorectal cancers.
Fang X, Yu W, Li L, Shao J, Zhao N, Chen Q, Ye Z, Lin SC, Zheng S, Lin B., OMICS 14(4), 2010
PMID: 20726797
Regulation of the 18 kDa heat shock protein in Mycobacterium ulcerans: an alpha-crystallin orthologue that promotes biofilm formation.
Pidot SJ, Porter JL, Tobias NJ, Anderson J, Catmull D, Seemann T, Kidd S, Davies JK, Reynolds E, Dashper S, Stinear TP., Mol Microbiol 78(5), 2010
PMID: 21091506
The complete genome sequence of Corynebacterium pseudotuberculosis FRC41 isolated from a 12-year-old girl with necrotizing lymphadenitis reveals insights into gene-regulatory networks contributing to virulence.
Trost E, Ott L, Schneider J, Schröder J, Jaenicke S, Goesmann A, Husemann P, Stoye J, Dorella FA, Rocha FS, Soares Sde C, D'Afonseca V, Miyoshi A, Ruiz J, Silva A, Azevedo V, Burkovski A, Guiso N, Join-Lambert OF, Kayal S, Tauch A., BMC Genomics 11(), 2010
PMID: 21192786
Genetic makeup of the Corynebacterium glutamicum LexA regulon deduced from comparative transcriptomics and in vitro DNA band shift assays.
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Hüser AT, Hansmeier N, Pühler A, Borovok I, Tauch A., Microbiology 155(pt 5), 2009
PMID: 19372162
MotifAdjuster: a tool for computational reassessment of transcription factor binding site annotations.
Keilwagen J, Baumbach J, Kohl TA, Grosse I., Genome Biol 10(5), 2009
PMID: 19409082
D-MATRIX: a web tool for constructing weight matrix of conserved DNA motifs.
Sen N, Mishra M, Khan F, Meena A, Sharma A., Bioinformation 3(10), 2009
PMID: 19759861
MOODS: fast search for position weight matrix matches in DNA sequences.
Korhonen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E., Bioinformatics 25(23), 2009
PMID: 19773334
Discovery of regulatory elements is improved by a discriminatory approach.
Valen E, Sandelin A, Winther O, Krogh A., PLoS Comput Biol 5(11), 2009
PMID: 19911049
Mycolactone gene expression is controlled by strong SigA-like promoters with utility in studies of Mycobacterium ulcerans and buruli ulcer.
Tobias NJ, Seemann T, Pidot SJ, Porter JL, Marsollier L, Marion E, Letournel F, Zakir T, Azuolas J, Wallace JR, Hong H, Davies JK, Howden BP, Johnson PD, Jenkin GA, Stinear TP., PLoS Negl Trop Dis 3(11), 2009
PMID: 19936295
A probabilistic method for small RNA flowgram matching.
Vacic V, Jin H, Zhu JK, Lonardi S., Pac Symp Biocomput (), 2008
PMID: 18229677
ReXSpecies--a tool for the analysis of the evolution of gene regulation across species.
Struckmann S, Araúzo-Bravo MJ, Schöler HR, Reinbold RA, Fuellen G., BMC Evol Biol 8(), 2008
PMID: 18410675
Efficient representation and P-value computation for high-order Markov motifs.
da Fonseca PG, Guimarães KS, Sagot MF., Bioinformatics 24(16), 2008
PMID: 18689819
Determining significance of pairwise co-occurrences of events in bursty sequences.
Haiminen N, Mannila H, Terzi E., BMC Bioinformatics 9(), 2008
PMID: 18691400
Performance evaluation of DNA motif discovery programs.
Singh CP, Khan F, Mishra BN, Chauhan DS., Bioinformation 3(5), 2008
PMID: 19255635
CoryneRegNet 3.0--an interactive systems biology platform for the analysis of gene regulatory networks in corynebacteria and Escherichia coli.
Baumbach J, Wittkop T, Rademacher K, Rahmann S, Brinkrolf K, Tauch A., J Biotechnol 129(2), 2007
PMID: 17229482
Computing exact P-values for DNA motifs.
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ., Bioinformatics 23(5), 2007
PMID: 17237046
Statistical significance of cis-regulatory modules.
Schones DE, Smith AD, Zhang MQ., BMC Bioinformatics 8(), 2007
PMID: 17241466
Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.
Wittkop T, Baumbach J, Lobo FP, Rahmann S., BMC Bioinformatics 8(), 2007
PMID: 17941985
Reduced alphabet motif methodology for GPCR annotation.
Gangal R, Kumar KK., J Biomol Struct Dyn 25(3), 2007
PMID: 17937491
Efficient and accurate P-value computation for Position Weight Matrices.
Touzet H, Varré JS., Algorithms Mol Biol 2(), 2007
PMID: 18072973
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469

36 References

Daten bereitgestellt von Europe PubMed Central.

Profile analysis: detection of distantly related proteins.
Gribskov M, McLachlan AD, Eisenberg D., Proc. Natl. Acad. Sci. U.S.A. 84(13), 1987
PMID: 3474607
Recent improvements to the PROSITE database.
Hulo N, Sigrist CJ, Le Saux V, Langendijk-Genevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, Bairoch A., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681377
PRINTS and its automatic supplement, prePRINTS.
Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C., Nucleic Acids Res. 31(1), 2003
PMID: 12520033
Increased coverage of protein families with the blocks database servers.
Henikoff JG, Greene EA, Pietrokovski S, Henikoff S., Nucleic Acids Res. 28(1), 2000
PMID: 10592233
Minimal-risk scoring matrices for sequence analysis
Wu T, Nevill-Manning C, Brutlag D., 1999
JASPAR: an open-access database for eukaryotic transcription factor binding profiles.
Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681366
TRANSFAC: transcriptional regulation, from patterns to profiles.
Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E., Nucleic Acids Res. 31(1), 2003
PMID: 12520026
FingerPRINTScan: intelligent searching of the PRINTS motif database.
Scordis P, Flower DR, Attwood TK., Bioinformatics 15(10), 1999
PMID: 10705433
MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data.
Quandt K, Frech K, Karas H, Wingender E, Werner T., Nucleic Acids Res. 23(23), 1995
PMID: 8532532
The efficient computation of position-specific match scores with the fast fourier transform.
Rajasekaran S, Jin X, Spouge JL., J. Comput. Biol. 9(1), 2002
PMID: 11911793
Using sequence compression to speedup probabilistic profile matching.
Freschi V, Bogliolo A., Bioinformatics 21(10), 2005
PMID: 15713733
Fast probabilistic analysis of sequence function using scoring matrices.
Wu TD, Nevill-Manning CG, Brutlag DL., Bioinformatics 16(3), 2000
PMID: 10869016
Accelerating Protein Classification Using Suffix Trees
Dorohonceanu B, Nevill-Manning C., 2000
Some string matching problems from Bioinformatics which still need better solutions
Gonnet H., 2004
Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
Tatusov RL, Altschul SF, Koonin EV., Proc. Natl. Acad. Sci. U.S.A. 91(25), 1994
PMID: 7991589
Using substitution probabilities to improve position-specific scoring matrices
Henikoff J, Henikoff S., 1996
MATCH: A tool for searching transcription factor binding sites in DNA sequences.
Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E., Nucleic Acids Res. 31(13), 2003
PMID: 12824369
Replacing Suffix Trees with Enhanced Suffix Arrays
Abouelhoda M, Kurtz S, Ohlebusch E., 2004
Reducing the Space Requirement of Suffix Trees
Kurtz S., 1999
A Comparison of Imperative and Purely Functional Suffix Tree Constructions
Giegerich R, Kurtz S., 1995
Dynamic programming algorithms for two statistical problems in computational biology
Rahmann S., 2003
On the Power of Profiles for Transcription Factor Binding Site Detection
Rahmann S, Müller T, Vingron M., 2003
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469
PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays
Beckstette M, Strothmann D, Homann R, Giegerich R, Kurtz S., 2004
Simple Linear Work Suffix Array Construction
Kärkkäinen J, Sanders P., 2003
Linear-time Longest-Common-Prefix Computation in Suffix Arrays and its Applications
Kasai T, Lee G, Arimura H, Arikawa S, Park K., 2001
A Combinatorial Problem
de N., 1946
Reduction of protein sequence complexity by residue grouping.
Li T, Fan K, Wang J, Wang W., Protein Eng. 16(5), 2003
PMID: 12826723
Simplified amino acid alphabets for protein fold recognition and implications for folding.
Murphy LR, Wallqvist A, Levy RM., Protein Eng. 13(3), 2000
PMID: 10775656

Castillo G., 1988

Embrechts P, Klüppelberg C, Mikosch T., 1997
Approximations to profile score distributions
Goldstein L, Waterman M., 1994
The Vmatch large scale sequence analysis software
Kurtz S., 2005
CisML: an XML-based format for sequence motif detection software.
Haverty PM, Weng Z., Bioinformatics 20(11), 2004
PMID: 15001475
A H+-gated urea channel: the link between Helicobacter pylori urease and gastric colonization.
Weeks DL, Eskandari S, Scott DR, Sachs G., Science 287(5452), 2000
PMID: 10642549

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

PMID: 16930469
PubMed | Europe PMC

Suchen in

Google Scholar