Large scale hierarchical clustering of protein sequences

Krause, Antje; Stoye, Jens; Vingron, Martin

Large scale hierarchical clustering of protein sequences

Krause A, Stoye J, Vingron M (2005)
BMC Bioinformatics 6(1): 15.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

BMC_Bioinformatics_2005_Krause.pdf

DOI

https://doi.org/10.1186/1471-2105-6-15

URN

urn:nbn:de:0070-pub-17751684

Autor*in

Krause, Antje; Stoye, Jens^UniBi ; Vingron, Martin

Einrichtung

Centrum für Biotechnologie > Arbeitsgruppe J. Stoye
Technische Fakultät > AG Genominformatik
Centrum für Biotechnologie > Institut für Bioinformatik

Abstract / Bemerkung

Background: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. Results: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/. Conclusions: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.

Stichworte

Protein Clustering; Clustering

Erscheinungsjahr

2005

Zeitschriftentitel

BMC Bioinformatics

Band

Ausgabe

Art.-Nr.

ISSN

1471-2105

Page URI

https://pub.uni-bielefeld.de/record/1775168

Zitieren

Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics. 2005;6(1): 15.

Krause, A., Stoye, J., & Vingron, M. (2005). Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6(1), 15. https://doi.org/10.1186/1471-2105-6-15

Krause, Antje, Stoye, Jens, and Vingron, Martin. 2005. “Large scale hierarchical clustering of protein sequences”. BMC Bioinformatics 6 (1): 15.

Krause, A., Stoye, J., and Vingron, M. (2005). Large scale hierarchical clustering of protein sequences. BMC Bioinformatics 6:15.

Krause, A., Stoye, J., & Vingron, M., 2005. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6(1): 15.

A. Krause, J. Stoye, and M. Vingron, “Large scale hierarchical clustering of protein sequences”, BMC Bioinformatics, vol. 6, 2005, : 15.

Krause, A., Stoye, J., Vingron, M.: Large scale hierarchical clustering of protein sequences. BMC Bioinformatics. 6, : 15 (2005).

Krause, Antje, Stoye, Jens, and Vingron, Martin. “Large scale hierarchical clustering of protein sequences”. BMC Bioinformatics 6.1 (2005): 15.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Copyright Statement:

Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]

Volltext(e)

Name

BMC_Bioinformatics_2005_Krause.pdf

Access Level

Open Access

Zuletzt Hochgeladen

2019-09-06T08:48:16Z

MD5 Prüfsumme

473a25334a1f7370968ee41ee08aec11

Daten bereitgestellt von European Bioinformatics Institute (EBI)

28 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, Sun Y., PLoS Comput Biol 13(4), 2017
PMID: 28437450

Biological function derived from predicted structures in CASP11.
Huwe PJ, Xu Q, Shapovalov MV, Modi V, Andrake MD, Dunbrack RL., Proteins 84 Suppl 1(), 2016
PMID: 27181425

Clustering analysis of proteins from microbial genomes at multiple levels of resolution.
Zaslavsky L, Ciufo S, Fedorov B, Tatusova T., BMC Bioinformatics 17 Suppl 8(), 2016
PMID: 27586436

Topological analysis of the Escherichia coli WcaJ protein reveals a new conserved configuration for the polyisoprenyl-phosphate hexose-1-phosphate transferase family.
Furlong SE, Ford A, Albarnez-Rodriguez L, Valvano MA., Sci Rep 5(), 2015
PMID: 25776537

Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage.
Hüffner F, Komusiewicz C, Liebtrau A, Niedermeier R., IEEE/ACM Trans Comput Biol Bioinform 11(3), 2014
PMID: 26356014

A protocol for species delineation of public DNA databases, applied to the Insecta.
Chesters D, Zhu CD., Syst Biol 63(5), 2014
PMID: 24929897

Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642

Environmental shaping of codon usage and functional adaptation across microbial communities.
Roller M, Lucić V, Nagy I, Perica T, Vlahovicek K., Nucleic Acids Res 41(19), 2013
PMID: 23921637

A laboratory information management system for DNA barcoding workflows.
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310

GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981

Meta-analysis of general bacterial subclades in whole-genome phylogenies using tree topology profiling.
Meinel T, Krause A., Evol Bioinform Online 8(), 2012
PMID: 22915837

Systematic and searchable classification of cytochrome P450 proteins encoded by fungal and oomycete genomes.
Moktali V, Park J, Fedorova-Abrams ND, Park B, Choi J, Lee YH, Kang S., BMC Genomics 13(), 2012
PMID: 23033934

Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810

Application of clustering analyses to the diagnosis of Huntington disease in mice and other diseases with well-defined group boundaries.
Nikas JB, Low WC., Comput Methods Programs Biomed 104(3), 2011
PMID: 21529982

Multicoil2: predicting coiled coils and their oligomerization states from sequence in the twilight zone.
Trigg J, Gutwin K, Keating AE, Berger B., PLoS One 6(8), 2011
PMID: 21901122

Ortho2ExpressMatrix--a web server that interprets cross-species gene expression data by gene family information.
Meinel T, Schweiger MR, Ludewig AH, Chenna R, Krobitsch S, Herwig R., BMC Genomics 12(), 2011
PMID: 21970648

Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221

Protein function annotation by homology-based inference.
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A., Genome Biol 10(2), 2009
PMID: 19226439

A roadmap of clustering algorithms: finding a match for a biomedical application.
Andreopoulos B, An A, Wang X, Schroeder M., Brief Bioinform 10(3), 2009
PMID: 19240124

S-layer, surface-accessible, and concanavalin A binding proteins of Methanosarcina acetivorans and Methanosarcina mazei.
Francoleon DR, Boontheung P, Yang Y, Kin U, Ytterberg AJ, Denny PA, Denny PC, Loo JA, Gunsalus RP, Loo RR., J Proteome Res 8(4), 2009
PMID: 19228054

Partitioning clustering algorithms for protein sequence data sets.
Fayech S, Essoussi N, Limam M., BioData Min 2(1), 2009
PMID: 19341454

Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015

MACHOS: Markov clusters of homologous subsequences.
Wong S, Ragan MA., Bioinformatics 24(13), 2008
PMID: 18586748

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.
Wittkop T, Baumbach J, Lobo FP, Rahmann S., BMC Bioinformatics 8(), 2007
PMID: 17941985

Automated learning of generative models for subcellular location: building blocks for systems biology.
Zhao T, Murphy RF., Cytometry A 71(12), 2007
PMID: 17972315

Reciprocal illumination in the gene content tree of life.
Lienau EK, DeSalle R, Rosenfeld JA, Planet PJ., Syst Biol 55(3), 2006
PMID: 16861208

A limited universe of membrane protein families and folds.
Oberai A, Ihm Y, Kim S, Bowie JU., Protein Sci 15(7), 2006
PMID: 16815920

Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?
Birkholtz LM, Bastien O, Wells G, Grando D, Joubert F, Kasam V, Zimmermann M, Ortet P, Jacq N, Saïdani N, Roy S, Hofmann-Apitius M, Breton V, Louw AI, Maréchal E., Malar J 5(), 2006
PMID: 17112376

25 References

Daten bereitgestellt von Europe PubMed Central.

ProtoMap: automatic classification of protein sequences and hierarchy of protein families.
Yona G, Linial N, Linial M., Nucleic Acids Res. 28(1), 2000
PMID: 10592179

ProtoNet: hierarchical classification of the protein space.
Sasson O, Vaaknin A, Fleischer H, Portugaly E, Bilu Y, Linial N, Linial M., Nucleic Acids Res. 31(1), 2003
PMID: 12520020

Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters.
Kriventseva EV, Servant F, Apweiler R., Nucleic Acids Res. 31(1), 2003
PMID: 12520029

iProClass: an integrated, comprehensive and annotated protein classification database.
Wu CH, Xiao C, Hou Z, Huang H, Barker WC., Nucleic Acids Res. 29(1), 2001
PMID: 11125047

PIRSF: family classification system at the Protein Information Resource.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681371

Graph-based clustering for finding distant relationships in a large set of protein sequences.
Kawaji H, Takenaka Y, Matsuda H., Bioinformatics 20(2), 2004
PMID: 14734316

An efficient algorithm for large-scale detection of protein families.
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018

Towards a covering set of protein family profiles.
Heger A, Holm L., Prog. Biophys. Mol. Biol. 73(5), 2000
PMID: 11063778

Domains, motifs and clusters in the protein universe.
Liu J, Rost B., Curr Opin Chem Biol 7(1), 2003
PMID: 12547420

A set-theoretic approach to database searching and clustering.
Krause A, Vingron M., Bioinformatics 14(5), 1998
PMID: 9682056

The Pfam protein families database.
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681378

The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M., Nucleic Acids Res. 31(1), 2003
PMID: 12520024

Ensembl 2004.
Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T, Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark C, Clamp M, Hubbard T., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681459

The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community.
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P., Nucleic Acids Res. 31(1), 2003
PMID: 12519987

Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms.
Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, Cherry JM., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681421

The genome sequence of Schizosaccharomyces pombe.
Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, Sgouros J, Peat N, Hayles J, Baker S, Basham D, Bowman S, Brooks K, Brown D, Brown S, Chillingworth T, Churcher C, Collins M, Connor R, Cronin A, Davis P, Feltwell T, Fraser A, Gentles S, Goble A, Hamlin N, Harris D, Hidalgo J, Hodgson G, Holroyd S, Hornsby T, Howarth S, Huckle EJ, Hunt S, Jagels K, James K, Jones L, Jones M, Leather S, McDonald S, McLean J, Mooney P, Moule S, Mungall K, Murphy L, Niblett D, Odell C, Oliver K, O'Neil S, Pearson D, Quail MA, Rabbinowitsch E, Rutherford K, Rutter S, Saunders D, Seeger K, Sharp S, Skelton J, Simmonds M, Squares R, Squares S, Stevens K, Taylor K, Taylor RG, Tivey A, Walsh S, Warren T, Whitehead S, Woodward J, Volckaert G, Aert R, Robben J, Grymonprez B, Weltjens I, Vanstreels E, Rieger M, Schafer M, Muller-Auer S, Gabel C, Fuchs M, Dusterhoft A, Fritzc C, Holzer E, Moestl D, Hilbert H, Borzym K, Langer I, Beck A, Lehrach H, Reinhardt R, Pohl TM, Eger P, Zimmermann W, Wedler H, Wambutt R, Purnelle B, Goffeau A, Cadieu E, Dreano S, Gloux S, Lelaure V, Mottier S, Galibert F, Aves SJ, Xiang Z, Hunt C, Moore K, Hurst SM, Lucas M, Rochet M, Gaillardin C, Tallada VA, Garzon A, Thode G, Daga RR, Cruzado L, Jimenez J, Sanchez M, del Rey F, Benito J, Dominguez A, Revuelta JL, Moreno S, Armstrong J, Forsburg SL, Cerutti L, Lowe T, McCombie WR, Paulsen I, Potashkin J, Shpakovski GV, Ussery D, Barrell BG, Nurse P, Cerrutti L., Nature 415(6874), 2002
PMID: 11859360

The SYSTERS Protein Family Database in 2005.
Meinel T, Krause A, Luz H, Vingron M, Staub E., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608183

Network analysis. The structure of the Web.
Kleinberg J, Lawrence S., Science 294(5548), 2001
PMID: 11729296

CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis
Sharan R, Shamir R., 2000

Identification of common molecular subsequences.
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238

Paracel
AUTHOR UNKNOWN, 0

An Algorithm for Clustering cDNAs for Gene Expression Analysis
Hartuv E, Schmitt A, Lange J, Meier-Evert S, Lehrach H, Shamir R., 1999

LEDA: A Platform for Combinatorial and Geometric Computing
Mehlhorn K, Näher S., 1995

The ENZYME database in 2000.
Bairoch A., Nucleic Acids Res. 28(1), 2000
PMID: 10592255

Nouvelles recherches sur la distribution florale
Jaccard P., 1908

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 15663796
PubMed | Europe PMC

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld