Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling
Wolfsheimer S, Herms I, Rahmann S, Hartmann AK (2011)
BMC Bioinformatics 12(1): 47.
Zeitschriftenaufsatz
| Veröffentlicht | Englisch
Download
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Autor*in
Wolfsheimer, Stefan;
Herms, InkeUniBi;
Rahmann, Sven;
Hartmann, Alexander K.
Einrichtung
Abstract / Bemerkung
Background: Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results: In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions: The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.
Erscheinungsjahr
2011
Zeitschriftentitel
BMC Bioinformatics
Band
12
Ausgabe
1
Art.-Nr.
47
ISSN
1471-2105
Page URI
https://pub.uni-bielefeld.de/record/2425310
Zitieren
Wolfsheimer S, Herms I, Rahmann S, Hartmann AK. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics. 2011;12(1): 47.
Wolfsheimer, S., Herms, I., Rahmann, S., & Hartmann, A. K. (2011). Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics, 12(1), 47. https://doi.org/10.1186/1471-2105-12-47
Wolfsheimer, Stefan, Herms, Inke, Rahmann, Sven, and Hartmann, Alexander K. 2011. “Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling”. BMC Bioinformatics 12 (1): 47.
Wolfsheimer, S., Herms, I., Rahmann, S., and Hartmann, A. K. (2011). Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics 12:47.
Wolfsheimer, S., et al., 2011. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics, 12(1): 47.
S. Wolfsheimer, et al., “Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling”, BMC Bioinformatics, vol. 12, 2011, : 47.
Wolfsheimer, S., Herms, I., Rahmann, S., Hartmann, A.K.: Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics. 12, : 47 (2011).
Wolfsheimer, Stefan, Herms, Inke, Rahmann, Sven, and Hartmann, Alexander K. “Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling”. BMC Bioinformatics 12.1 (2011): 47.
Daten bereitgestellt von European Bioinformatics Institute (EBI)
9 Zitationen in Europe PMC
Daten bereitgestellt von Europe PubMed Central.
Statistical significance based on length and position of the local score in a model of i.i.d. sequences.
Lagnoux A, Mercier S, Vallois P., Bioinformatics 33(5), 2017
PMID: 28035025
Lagnoux A, Mercier S, Vallois P., Bioinformatics 33(5), 2017
PMID: 28035025
Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.
Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, Li K., BMC Bioinformatics 18(1), 2017
PMID: 28606086
Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, Li K., BMC Bioinformatics 18(1), 2017
PMID: 28606086
Identification of non-random sequence properties in groups of signature peptides obtained in random sequence peptide microarray experiments.
Kuznetsov IB., Biopolymers 106(3), 2016
PMID: 27037995
Kuznetsov IB., Biopolymers 106(3), 2016
PMID: 27037995
Discovery of prognostic biomarkers for predicting lung cancer metastasis using microarray and survival data.
Huang HL, Wu YC, Su LJ, Huang YJ, Charoenkwan P, Chen WL, Lee HC, Chu WC, Ho SY., BMC Bioinformatics 16(), 2015
PMID: 25881029
Huang HL, Wu YC, Su LJ, Huang YJ, Charoenkwan P, Chen WL, Lee HC, Chu WC, Ho SY., BMC Bioinformatics 16(), 2015
PMID: 25881029
PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids.
Kuznetsov IB, McDuffie M., BMC Res Notes 8(), 2015
PMID: 25947299
Kuznetsov IB, McDuffie M., BMC Res Notes 8(), 2015
PMID: 25947299
Predicting Neuroinflammation in Morphine Tolerance for Tolerance Therapy from Immunostaining Images of Rat Spinal Cord.
Lin SL, Chang FL, Ho SY, Charoenkwan P, Wang KW, Huang HL., PLoS One 10(10), 2015
PMID: 26437460
Lin SL, Chang FL, Ho SY, Charoenkwan P, Wang KW, Huang HL., PLoS One 10(10), 2015
PMID: 26437460
Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes.
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H., PLoS One 9(1), 2014
PMID: 24475169
Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H., PLoS One 9(1), 2014
PMID: 24475169
SCMHBP: prediction and analysis of heme binding proteins using propensity scores of dipeptides.
Liou YF, Charoenkwan P, Srinivasulu Y, Vasylenko T, Lai SC, Lee HC, Chen YH, Huang HL, Ho SY., BMC Bioinformatics 15 Suppl 16(), 2014
PMID: 25522279
Liou YF, Charoenkwan P, Srinivasulu Y, Vasylenko T, Lai SC, Lee HC, Chen YH, Huang HL, Ho SY., BMC Bioinformatics 15 Suppl 16(), 2014
PMID: 25522279
Pharmacophore Alignment Search Tool (PhAST): Significance Assessment of Chemical Similarity.
Hähnke V, Rupp M, Hartmann AK, Schneider G., Mol Inform 32(7), 2013
PMID: 27481770
Hähnke V, Rupp M, Hartmann AK, Schneider G., Mol Inform 32(7), 2013
PMID: 27481770
46 References
Daten bereitgestellt von Europe PubMed Central.
AUTHOR UNKNOWN, 2005
AUTHOR UNKNOWN, 1998
Identification of common molecular subsequences.
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238
A tutorial on hidden Markov models and selected applications in speech recognition
AUTHOR UNKNOWN, 1989
AUTHOR UNKNOWN, 1989
Basic local alignment search tool.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., J. Mol. Biol. 215(3), 1990
PMID: 2231712
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ., J. Mol. Biol. 215(3), 1990
PMID: 2231712
AUTHOR UNKNOWN, 2009
A new approach to sequence comparison: normalized sequence alignment.
Arslan AN, Egecioglu O, Pevzner PA., Bioinformatics 17(4), 2001
PMID: 11301301
Arslan AN, Egecioglu O, Pevzner PA., Bioinformatics 17(4), 2001
PMID: 11301301
The Universal Protein Resource (UniProt).
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608167
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608167
Amino acid substitution matrices from protein blocks.
Henikoff S, Henikoff JG., Proc. Natl. Acad. Sci. U.S.A. 89(22), 1992
PMID: 1438297
Henikoff S, Henikoff JG., Proc. Natl. Acad. Sci. U.S.A. 89(22), 1992
PMID: 1438297
Exact distribution for the local score of one i.i.d. random sequence.
Mercier S, Daudin JJ., J. Comput. Biol. 8(4), 2001
PMID: 11571073
Mercier S, Daudin JJ., J. Comput. Biol. 8(4), 2001
PMID: 11571073
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
Karlin S, Altschul SF., Proc. Natl. Acad. Sci. U.S.A. 87(6), 1990
PMID: 2315319
Karlin S, Altschul SF., Proc. Natl. Acad. Sci. U.S.A. 87(6), 1990
PMID: 2315319
AUTHOR UNKNOWN, 1958
Large Deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments
AUTHOR UNKNOWN, 2004
AUTHOR UNKNOWN, 2004
Rapid and accurate estimates of statistical significance for sequence data base searches.
Waterman MS, Vingron M., Proc. Natl. Acad. Sci. U.S.A. 91(11), 1994
PMID: 8197109
Waterman MS, Vingron M., Proc. Natl. Acad. Sci. U.S.A. 91(11), 1994
PMID: 8197109
The estimation of statistical parameters for local alignment score distributions.
Altschul SF, Bundschuh R, Olsen R, Hwa T., Nucleic Acids Res. 29(2), 2001
PMID: 11139604
Altschul SF, Bundschuh R, Olsen R, Hwa T., Nucleic Acids Res. 29(2), 2001
PMID: 11139604
Sampling rare events: statistics of local sequence alignments.
Hartmann AK., Phys Rev E Stat Nonlin Soft Matter Phys 65(5 Pt 2), 2002
PMID: 12059642
Hartmann AK., Phys Rev E Stat Nonlin Soft Matter Phys 65(5 Pt 2), 2002
PMID: 12059642
Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail.
Wolfsheimer S, Burghardt B, Hartmann AK., Algorithms Mol Biol 2(), 2007
PMID: 17625018
Wolfsheimer S, Burghardt B, Hartmann AK., Algorithms Mol Biol 2(), 2007
PMID: 17625018
The compositional adjustment of amino acid substitution matrices.
Yu YK, Wootton JC, Altschul SF., Proc. Natl. Acad. Sci. U.S.A. 100(26), 2003
PMID: 14663142
Yu YK, Wootton JC, Altschul SF., Proc. Natl. Acad. Sci. U.S.A. 100(26), 2003
PMID: 14663142
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions.
Yu YK, Altschul SF., Bioinformatics 21(7), 2004
PMID: 15509610
Yu YK, Altschul SF., Bioinformatics 21(7), 2004
PMID: 15509610
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
AUTHOR UNKNOWN, 2003
Non-symmetric score matrices and the detection of homologous transmembrane proteins
AUTHOR UNKNOWN, 2001
AUTHOR UNKNOWN, 2001
A probabilistic model of local sequence alignment that simplifies statistical significance estimation.
Eddy SR., PLoS Comput. Biol. 4(5), 2008
PMID: 18516236
Eddy SR., PLoS Comput. Biol. 4(5), 2008
PMID: 18516236
A hidden Markov model for predicting transmembrane helices in protein sequences
AUTHOR UNKNOWN, 1998
AUTHOR UNKNOWN, 1998
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.
Krogh A, Larsson B, von Heijne G, Sonnhammer EL., J. Mol. Biol. 305(3), 2001
PMID: 11152613
Krogh A, Larsson B, von Heijne G, Sonnhammer EL., J. Mol. Biol. 305(3), 2001
PMID: 11152613
Monte Carlo Sampling Methods Using Markov Chains and Their Applications
AUTHOR UNKNOWN, 1970
AUTHOR UNKNOWN, 1970
AUTHOR UNKNOWN, 1999
AUTHOR UNKNOWN, 2008
Multicanonical ensemble: A new approach to simulate first-order phase transitions.
Berg BA, Neuhaus T., Phys. Rev. Lett. 68(1), 1992
PMID: 10045099
Berg BA, Neuhaus T., Phys. Rev. Lett. 68(1), 1992
PMID: 10045099
Transition Matrix Monte Carlo Reweighting and Dynamics
AUTHOR UNKNOWN, 1999
AUTHOR UNKNOWN, 1999
Transition matrix Monte Carlo method
AUTHOR UNKNOWN, 1999
AUTHOR UNKNOWN, 1999
Monte Carlo algorithms based on the number of potential moves
AUTHOR UNKNOWN, 2000
AUTHOR UNKNOWN, 2000
Efficient, multiple-range random walk algorithm to calculate the density of states.
Wang F, Landau DP., Phys. Rev. Lett. 86(10), 2001
PMID: 11289852
Wang F, Landau DP., Phys. Rev. Lett. 86(10), 2001
PMID: 11289852
Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram.
Wang F, Landau DP., Phys Rev E Stat Nonlin Soft Matter Phys 64(5 Pt 2), 2001
PMID: 11736008
Wang F, Landau DP., Phys Rev E Stat Nonlin Soft Matter Phys 64(5 Pt 2), 2001
PMID: 11736008
Error estimates on averages of correlated data
AUTHOR UNKNOWN, 1989
AUTHOR UNKNOWN, 1989
On orthogonal and symplectic matrix ensembles
AUTHOR UNKNOWN, 1996
AUTHOR UNKNOWN, 1996
Exact asymptotic results for the Bernoulli matching model of sequence alignment.
Majumdar SN, Nechaev S., Phys Rev E Stat Nonlin Soft Matter Phys 72(2 Pt 1), 2005
PMID: 16196539
Majumdar SN, Nechaev S., Phys Rev E Stat Nonlin Soft Matter Phys 72(2 Pt 1), 2005
PMID: 16196539
Exact solution of the Bernoulli matching model of sequence alignment
AUTHOR UNKNOWN, 2008
AUTHOR UNKNOWN, 2008
Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem.
Sardiu ME, Alves G, Yu YK., Phys Rev E Stat Nonlin Soft Matter Phys 72(6 Pt 1), 2005
PMID: 16485984
Sardiu ME, Alves G, Yu YK., Phys Rev E Stat Nonlin Soft Matter Phys 72(6 Pt 1), 2005
PMID: 16485984
The ASTRAL Compendium in 2004.
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681391
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681391
AUTHOR UNKNOWN, 1976
Performance limitations of flat-histogram methods.
Dayal P, Trebst S, Wessel S, Wurtz D, Troyer M, Sabhapandit S, Coppersmith SN., Phys. Rev. Lett. 92(9), 2004
PMID: 15089505
Dayal P, Trebst S, Wessel S, Wurtz D, Troyer M, Sabhapandit S, Coppersmith SN., Phys. Rev. Lett. 92(9), 2004
PMID: 15089505
Optimizing the ensemble for equilibration in broad-histogram Monte Carlo simulations.
Trebst S, Huse DA, Troyer M., Phys Rev E Stat Nonlin Soft Matter Phys 70(4 Pt 2), 2004
PMID: 15600559
Trebst S, Huse DA, Troyer M., Phys Rev E Stat Nonlin Soft Matter Phys 70(4 Pt 2), 2004
PMID: 15600559
Significance of gapped sequence alignments.
Newberg LA., J. Comput. Biol. 15(9), 2008
PMID: 18973434
Newberg LA., J. Comput. Biol. 15(9), 2008
PMID: 18973434
Export
Markieren/ Markierung löschen
Markierte Publikationen
Web of Science
Dieser Datensatz im Web of Science®Quellen
PMID: 21291566
PubMed | Europe PMC
Suchen in