IsoSVM - Distinguishing isoforms and paralogs on the protein level

Spitzer M, Lorkowski S, Cullen P, Sczyrba A, Fuellen G (2006)
BMC Bioinformatics 7(1).

Download
OA
Journal Article | Published | English
Author
; ; ; ;
Abstract
Background: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information ( e. g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms ( splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not. Results: The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of Xenopus laevis EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution. Conclusion: We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM.
Publishing Year
ISSN
PUB-ID

Cite this

Spitzer M, Lorkowski S, Cullen P, Sczyrba A, Fuellen G. IsoSVM - Distinguishing isoforms and paralogs on the protein level. BMC Bioinformatics. 2006;7(1).
Spitzer, M., Lorkowski, S., Cullen, P., Sczyrba, A., & Fuellen, G. (2006). IsoSVM - Distinguishing isoforms and paralogs on the protein level. BMC Bioinformatics, 7(1).
Spitzer, M., Lorkowski, S., Cullen, P., Sczyrba, A., and Fuellen, G. (2006). IsoSVM - Distinguishing isoforms and paralogs on the protein level. BMC Bioinformatics 7.
Spitzer, M., et al., 2006. IsoSVM - Distinguishing isoforms and paralogs on the protein level. BMC Bioinformatics, 7(1).
M. Spitzer, et al., “IsoSVM - Distinguishing isoforms and paralogs on the protein level”, BMC Bioinformatics, vol. 7, 2006.
Spitzer, M., Lorkowski, S., Cullen, P., Sczyrba, A., Fuellen, G.: IsoSVM - Distinguishing isoforms and paralogs on the protein level. BMC Bioinformatics. 7, (2006).
Spitzer, M, Lorkowski, S, Cullen, P, Sczyrba, Alexander, and Fuellen, Georg. “IsoSVM - Distinguishing isoforms and paralogs on the protein level”. BMC Bioinformatics 7.1 (2006).
Main File(s)
Access Level
OA Open Access

This data publication is cited in the following publications:
This publication cites the following data publications:

2 Citations in Europe PMC

Data provided by Europe PubMed Central.

Ancient dynamin segments capture early stages of host-mitochondrial integration.
Purkanti R, Thattai M., Proc. Natl. Acad. Sci. U.S.A. 112(9), 2015
PMID: 25691734

37 References

Data provided by Europe PubMed Central.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
The human ATP-binding cassette (ABC) transporter superfamily.
Dean M, Rzhetsky A, Allikmets R., Genome Res. 11(7), 2001
PMID: 11435397
IsoSVM
AUTHOR UNKNOWN, 0
A practical guide to support vector classification
Hsu CW, Chang CC, Lin CJ., 0
Neural Network FAQ
Sarle WS., 1997
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
Katoh K, Misawa K, Kuma K, Miyata T., Nucleic Acids Res. 30(14), 2002
PMID: 12136088
The rapid generation of mutation data matrices from protein sequences.
Jones DT, Taylor WR, Thornton JM., Comput. Appl. Biosci. 8(3), 1992
PMID: 1633570
A Gentle Guide to Multiple Alignment
Fuellen G., 1997
MView: a web-compatible database search or multiple alignment viewer.
Brown NP, Leroy C, Sander C., Bioinformatics 14(4), 1998
PMID: 9632837
Soft Margins for AdaBoost
Rätsch G, Onoda T, Müller K., 2001
A leisurely look at the bootstrap, the jackknife, and cross-validation
Efron B, Gong G., 1983

Export

0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®

Sources

PMID: 16519805
PubMed | Europe PMC

Search this title in

Google Scholar