Large scale hierarchical clustering of protein sequences

Krause A, Stoye J, Vingron M (2005)
BMC Bioinformatics 6(1): 15.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
Krause, Antje; Stoye, JensUniBi ; Vingron, Martin
Abstract / Bemerkung
Background: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. Results: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at Conclusions: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.
Protein Clustering; Clustering
BMC Bioinformatics
Page URI


Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics. 2005;6(1): 15.
Krause, A., Stoye, J., & Vingron, M. (2005). Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6(1), 15.
Krause, Antje, Stoye, Jens, and Vingron, Martin. 2005. “Large scale hierarchical clustering of protein sequences”. BMC Bioinformatics 6 (1): 15.
Krause, A., Stoye, J., and Vingron, M. (2005). Large scale hierarchical clustering of protein sequences. BMC Bioinformatics 6:15.
Krause, A., Stoye, J., & Vingron, M., 2005. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6(1): 15.
A. Krause, J. Stoye, and M. Vingron, “Large scale hierarchical clustering of protein sequences”, BMC Bioinformatics, vol. 6, 2005, : 15.
Krause, A., Stoye, J., Vingron, M.: Large scale hierarchical clustering of protein sequences. BMC Bioinformatics. 6, : 15 (2005).
Krause, Antje, Stoye, Jens, and Vingron, Martin. “Large scale hierarchical clustering of protein sequences”. BMC Bioinformatics 6.1 (2005): 15.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]
Access Level
OA Open Access
Zuletzt Hochgeladen
MD5 Prüfsumme

28 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, Sun Y., PLoS Comput Biol 13(4), 2017
PMID: 28437450
Biological function derived from predicted structures in CASP11.
Huwe PJ, Xu Q, Shapovalov MV, Modi V, Andrake MD, Dunbrack RL., Proteins 84 Suppl 1(), 2016
PMID: 27181425
Clustering analysis of proteins from microbial genomes at multiple levels of resolution.
Zaslavsky L, Ciufo S, Fedorov B, Tatusova T., BMC Bioinformatics 17 Suppl 8(), 2016
PMID: 27586436
Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage.
Hüffner F, Komusiewicz C, Liebtrau A, Niedermeier R., IEEE/ACM Trans Comput Biol Bioinform 11(3), 2014
PMID: 26356014
Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
Environmental shaping of codon usage and functional adaptation across microbial communities.
Roller M, Lucić V, Nagy I, Perica T, Vlahovicek K., Nucleic Acids Res 41(19), 2013
PMID: 23921637
A laboratory information management system for DNA barcoding workflows.
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310
GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981
Systematic and searchable classification of cytochrome P450 proteins encoded by fungal and oomycete genomes.
Moktali V, Park J, Fedorova-Abrams ND, Park B, Choi J, Lee YH, Kang S., BMC Genomics 13(), 2012
PMID: 23033934
Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Ortho2ExpressMatrix--a web server that interprets cross-species gene expression data by gene family information.
Meinel T, Schweiger MR, Ludewig AH, Chenna R, Krobitsch S, Herwig R., BMC Genomics 12(), 2011
PMID: 21970648
Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221
Protein function annotation by homology-based inference.
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A., Genome Biol 10(2), 2009
PMID: 19226439
A roadmap of clustering algorithms: finding a match for a biomedical application.
Andreopoulos B, An A, Wang X, Schroeder M., Brief Bioinform 10(3), 2009
PMID: 19240124
S-layer, surface-accessible, and concanavalin A binding proteins of Methanosarcina acetivorans and Methanosarcina mazei.
Francoleon DR, Boontheung P, Yang Y, Kin U, Ytterberg AJ, Denny PA, Denny PC, Loo JA, Gunsalus RP, Loo RR., J Proteome Res 8(4), 2009
PMID: 19228054
Partitioning clustering algorithms for protein sequence data sets.
Fayech S, Essoussi N, Limam M., BioData Min 2(1), 2009
PMID: 19341454
Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
MACHOS: Markov clusters of homologous subsequences.
Wong S, Ragan MA., Bioinformatics 24(13), 2008
PMID: 18586748
Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.
Wittkop T, Baumbach J, Lobo FP, Rahmann S., BMC Bioinformatics 8(), 2007
PMID: 17941985
Reciprocal illumination in the gene content tree of life.
Lienau EK, DeSalle R, Rosenfeld JA, Planet PJ., Syst Biol 55(3), 2006
PMID: 16861208
A limited universe of membrane protein families and folds.
Oberai A, Ihm Y, Kim S, Bowie JU., Protein Sci 15(7), 2006
PMID: 16815920
Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?
Birkholtz LM, Bastien O, Wells G, Grando D, Joubert F, Kasam V, Zimmermann M, Ortet P, Jacq N, Saïdani N, Roy S, Hofmann-Apitius M, Breton V, Louw AI, Maréchal E., Malar J 5(), 2006
PMID: 17112376

25 References

Daten bereitgestellt von Europe PubMed Central.

ProtoMap: automatic classification of protein sequences and hierarchy of protein families.
Yona G, Linial N, Linial M., Nucleic Acids Res. 28(1), 2000
PMID: 10592179
ProtoNet: hierarchical classification of the protein space.
Sasson O, Vaaknin A, Fleischer H, Portugaly E, Bilu Y, Linial N, Linial M., Nucleic Acids Res. 31(1), 2003
PMID: 12520020
Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters.
Kriventseva EV, Servant F, Apweiler R., Nucleic Acids Res. 31(1), 2003
PMID: 12520029
iProClass: an integrated, comprehensive and annotated protein classification database.
Wu CH, Xiao C, Hou Z, Huang H, Barker WC., Nucleic Acids Res. 29(1), 2001
PMID: 11125047
PIRSF: family classification system at the Protein Information Resource.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681371
Graph-based clustering for finding distant relationships in a large set of protein sequences.
Kawaji H, Takenaka Y, Matsuda H., Bioinformatics 20(2), 2004
PMID: 14734316
An efficient algorithm for large-scale detection of protein families.
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018
Towards a covering set of protein family profiles.
Heger A, Holm L., Prog. Biophys. Mol. Biol. 73(5), 2000
PMID: 11063778
Domains, motifs and clusters in the protein universe.
Liu J, Rost B., Curr Opin Chem Biol 7(1), 2003
PMID: 12547420
A set-theoretic approach to database searching and clustering.
Krause A, Vingron M., Bioinformatics 14(5), 1998
PMID: 9682056
The Pfam protein families database.
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681378
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M., Nucleic Acids Res. 31(1), 2003
PMID: 12520024
Ensembl 2004.
Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T, Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark C, Clamp M, Hubbard T., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681459
The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community.
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P., Nucleic Acids Res. 31(1), 2003
PMID: 12519987
Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms.
Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, Cherry JM., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681421
The genome sequence of Schizosaccharomyces pombe.
Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, Sgouros J, Peat N, Hayles J, Baker S, Basham D, Bowman S, Brooks K, Brown D, Brown S, Chillingworth T, Churcher C, Collins M, Connor R, Cronin A, Davis P, Feltwell T, Fraser A, Gentles S, Goble A, Hamlin N, Harris D, Hidalgo J, Hodgson G, Holroyd S, Hornsby T, Howarth S, Huckle EJ, Hunt S, Jagels K, James K, Jones L, Jones M, Leather S, McDonald S, McLean J, Mooney P, Moule S, Mungall K, Murphy L, Niblett D, Odell C, Oliver K, O'Neil S, Pearson D, Quail MA, Rabbinowitsch E, Rutherford K, Rutter S, Saunders D, Seeger K, Sharp S, Skelton J, Simmonds M, Squares R, Squares S, Stevens K, Taylor K, Taylor RG, Tivey A, Walsh S, Warren T, Whitehead S, Woodward J, Volckaert G, Aert R, Robben J, Grymonprez B, Weltjens I, Vanstreels E, Rieger M, Schafer M, Muller-Auer S, Gabel C, Fuchs M, Dusterhoft A, Fritzc C, Holzer E, Moestl D, Hilbert H, Borzym K, Langer I, Beck A, Lehrach H, Reinhardt R, Pohl TM, Eger P, Zimmermann W, Wedler H, Wambutt R, Purnelle B, Goffeau A, Cadieu E, Dreano S, Gloux S, Lelaure V, Mottier S, Galibert F, Aves SJ, Xiang Z, Hunt C, Moore K, Hurst SM, Lucas M, Rochet M, Gaillardin C, Tallada VA, Garzon A, Thode G, Daga RR, Cruzado L, Jimenez J, Sanchez M, del Rey F, Benito J, Dominguez A, Revuelta JL, Moreno S, Armstrong J, Forsburg SL, Cerutti L, Lowe T, McCombie WR, Paulsen I, Potashkin J, Shpakovski GV, Ussery D, Barrell BG, Nurse P, Cerrutti L., Nature 415(6874), 2002
PMID: 11859360
The SYSTERS Protein Family Database in 2005.
Meinel T, Krause A, Luz H, Vingron M, Staub E., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608183
Network analysis. The structure of the Web.
Kleinberg J, Lawrence S., Science 294(5548), 2001
PMID: 11729296
CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis
Sharan R, Shamir R., 2000
Identification of common molecular subsequences.
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238
An Algorithm for Clustering cDNAs for Gene Expression Analysis
Hartuv E, Schmitt A, Lange J, Meier-Evert S, Lehrach H, Shamir R., 1999
LEDA: A Platform for Combinatorial and Geometric Computing
Mehlhorn K, Näher S., 1995
The ENZYME database in 2000.
Bairoch A., Nucleic Acids Res. 28(1), 2000
PMID: 10592255
Nouvelles recherches sur la distribution florale
Jaccard P., 1908

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

PMID: 15663796
PubMed | Europe PMC

Suchen in

Google Scholar