# Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

Wolfsheimer S, Herms I, Rahmann S, Hartmann AK (2011) *BMC Bioinformatics* 12(1): 47-2105.

Download

**No fulltext has been uploaded. References only!**

*Journal Article*|

*Original Article*|

*Published*|

*English*

No fulltext has been uploaded

Author

Department

Abstract

Background: Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results: In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions: The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.

Publishing Year

ISSN

PUB-ID

### Cite this

Wolfsheimer S, Herms I, Rahmann S, Hartmann AK. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.

*BMC Bioinformatics*. 2011;12(1):47-2105.Wolfsheimer, S., Herms, I., Rahmann, S., & Hartmann, A. K. (2011). Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.

*BMC Bioinformatics*,*12*(1), 47-2105. doi:10.1186/1471-2105-12-47Wolfsheimer, S., Herms, I., Rahmann, S., and Hartmann, A. K. (2011). Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.

*BMC Bioinformatics*12, 47-2105.Wolfsheimer, S., et al., 2011. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.

*BMC Bioinformatics*, 12(1), p 47-2105. S. Wolfsheimer, et al., “Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling”,

*BMC Bioinformatics*, vol. 12, 2011, pp. 47-2105. Wolfsheimer, S., Herms, I., Rahmann, S., Hartmann, A.K.: Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics. 12, 47-2105 (2011).

Wolfsheimer, Stefan, Herms, Inke, Rahmann, Sven, and Hartmann, Alexander K. “Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling”.

*BMC Bioinformatics*12.1 (2011): 47-2105.
This data publication is cited in the following publications:

This publication cites the following data publications:

### 7 Citations in Europe PMC

Data provided by Europe PubMed Central.

Statistical significance based on length and position of the local score in a model of i.i.d. sequences.

Lagnoux A, Mercier S, Vallois P.,

PMID: 28035025

Lagnoux A, Mercier S, Vallois P.,

*Bioinformatics*33(5), 2017PMID: 28035025

Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.

Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, Li K.,

PMID: 28606086

Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, Li K.,

*BMC Bioinformatics*18(1), 2017PMID: 28606086

Identification of non-random sequence properties in groups of signature peptides obtained in random sequence peptide microarray experiments.

Kuznetsov IB.,

PMID: 27037995

Kuznetsov IB.,

*Biopolymers*106(3), 2016PMID: 27037995

Discovery of prognostic biomarkers for predicting lung cancer metastasis using microarray and survival data.

Huang HL, Wu YC, Su LJ, Huang YJ, Charoenkwan P, Chen WL, Lee HC, Chu WC, Ho SY.,

PMID: 25881029

Huang HL, Wu YC, Su LJ, Huang YJ, Charoenkwan P, Chen WL, Lee HC, Chu WC, Ho SY.,

*BMC Bioinformatics*16(), 2015PMID: 25881029

PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids.

Kuznetsov IB, McDuffie M.,

PMID: 25947299

Kuznetsov IB, McDuffie M.,

*BMC Res Notes*8(), 2015PMID: 25947299

SCMHBP: prediction and analysis of heme binding proteins using propensity scores of dipeptides.

Liou YF, Charoenkwan P, Srinivasulu Y, Vasylenko T, Lai SC, Lee HC, Chen YH, Huang HL, Ho SY.,

PMID: 25522279

Liou YF, Charoenkwan P, Srinivasulu Y, Vasylenko T, Lai SC, Lee HC, Chen YH, Huang HL, Ho SY.,

*BMC Bioinformatics*15 Suppl 16(), 2014PMID: 25522279

Pharmacophore Alignment Search Tool (PhAST): Significance Assessment of Chemical Similarity.

Hahnke V, Rupp M, Hartmann AK, Schneider G.,

PMID: 27481770

Hahnke V, Rupp M, Hartmann AK, Schneider G.,

*Mol Inform*32(7), 2013PMID: 27481770

### 46 References

Data provided by Europe PubMed Central.

Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Krogh A, Larsson B, von Heijne G, Sonnhammer EL.,

PMID: 11152613

Krogh A, Larsson B, von Heijne G, Sonnhammer EL.,

*J. Mol. Biol.*305(3), 2001PMID: 11152613

Monte Carlo Sampling Methods Using Markov Chains and Their Applications

AUTHOR UNKNOWN, 1970

AUTHOR UNKNOWN, 1970

AUTHOR UNKNOWN, 1999

AUTHOR UNKNOWN, 2008

Multicanonical ensemble: A new approach to simulate first-order phase transitions.

Berg BA, Neuhaus T.,

PMID: 10045099

Berg BA, Neuhaus T.,

*Phys. Rev. Lett.*68(1), 1992PMID: 10045099

Transition Matrix Monte Carlo Reweighting and Dynamics

AUTHOR UNKNOWN, 1999

AUTHOR UNKNOWN, 1999

Transition matrix Monte Carlo method

AUTHOR UNKNOWN, 1999

AUTHOR UNKNOWN, 1999

Monte Carlo algorithms based on the number of potential moves

AUTHOR UNKNOWN, 2000

AUTHOR UNKNOWN, 2000

Efficient, multiple-range random walk algorithm to calculate the density of states.

Wang F, Landau DP.,

PMID: 11289852

Wang F, Landau DP.,

*Phys. Rev. Lett.*86(10), 2001PMID: 11289852

Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram.

Wang F, Landau DP.,

PMID: 11736008

Wang F, Landau DP.,

*Phys Rev E Stat Nonlin Soft Matter Phys*64(5 Pt 2), 2001PMID: 11736008

Error estimates on averages of correlated data

AUTHOR UNKNOWN, 1989

AUTHOR UNKNOWN, 1989

On orthogonal and symplectic matrix ensembles

AUTHOR UNKNOWN, 1996

AUTHOR UNKNOWN, 1996

Exact asymptotic results for the Bernoulli matching model of sequence alignment.

Majumdar SN, Nechaev S.,

PMID: 16196539

Majumdar SN, Nechaev S.,

*Phys Rev E Stat Nonlin Soft Matter Phys*72(2 Pt 1), 2005PMID: 16196539

Exact solution of the Bernoulli matching model of sequence alignment

AUTHOR UNKNOWN, 2008

AUTHOR UNKNOWN, 2008

Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem.

Sardiu ME, Alves G, Yu YK.,

PMID: 16485984

Sardiu ME, Alves G, Yu YK.,

*Phys Rev E Stat Nonlin Soft Matter Phys*72(6 Pt 1), 2005PMID: 16485984

The ASTRAL Compendium in 2004.

Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE.,

PMID: 14681391

Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE.,

*Nucleic Acids Res.*32(Database issue), 2004PMID: 14681391

AUTHOR UNKNOWN, 1976

Performance limitations of flat-histogram methods.

Dayal P, Trebst S, Wessel S, Wurtz D, Troyer M, Sabhapandit S, Coppersmith SN.,

PMID: 15089505

Dayal P, Trebst S, Wessel S, Wurtz D, Troyer M, Sabhapandit S, Coppersmith SN.,

*Phys. Rev. Lett.*92(9), 2004PMID: 15089505

Optimizing the ensemble for equilibration in broad-histogram Monte Carlo simulations.

Trebst S, Huse DA, Troyer M.,

PMID: 15600559

Trebst S, Huse DA, Troyer M.,

*Phys Rev E Stat Nonlin Soft Matter Phys*70(4 Pt 2), 2004PMID: 15600559

Significance of gapped sequence alignments.

Newberg LA.,

PMID: 18973434

Newberg LA.,

*J. Comput. Biol.*15(9), 2008PMID: 18973434

### Export

0 Marked Publications### Web of Science

View record in Web of Science®### Sources

PMID: 21291566

PubMed | Europe PMC