Large scale hierarchical clustering of protein sequences

Krause A, Stoye J, Vingron M (2005)
BMC Bioinformatics 6(1).

Journal Article | Published | English
; ;
Background: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. Results: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at Conclusions: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.
Publishing Year

Cite this

Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics. 2005;6(1).
Krause, A., Stoye, J., & Vingron, M. (2005). Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6(1).
Krause, A., Stoye, J., and Vingron, M. (2005). Large scale hierarchical clustering of protein sequences. BMC Bioinformatics 6.
Krause, A., Stoye, J., & Vingron, M., 2005. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6(1).
A. Krause, J. Stoye, and M. Vingron, “Large scale hierarchical clustering of protein sequences”, BMC Bioinformatics, vol. 6, 2005.
Krause, A., Stoye, J., Vingron, M.: Large scale hierarchical clustering of protein sequences. BMC Bioinformatics. 6, (2005).
Krause, Antje, Stoye, Jens, and Vingron, Martin. “Large scale hierarchical clustering of protein sequences”. BMC Bioinformatics 6.1 (2005).
Main File(s)
Access Level
OA Open Access

This data publication is cited in the following publications:
This publication cites the following data publications:

21 Citations in Europe PMC

Data provided by Europe PubMed Central.

Partitioning Biological Networks into Highly Connected Clusters with Maximum Edge Coverage.
Huffner F, Komusiewicz C, Liebtrau A, Niedermeier R., IEEE/ACM Trans Comput Biol Bioinform 11(3), 2014
PMID: 26356014
Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szoke S, Wiwie C, Baumbach J, Cardinali G, Rottger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
Environmental shaping of codon usage and functional adaptation across microbial communities.
Roller M, Lucic V, Nagy I, Perica T, Vlahovicek K., Nucleic Acids Res. 41(19), 2013
PMID: 23921637
Systematic and searchable classification of cytochrome P450 proteins encoded by fungal and oomycete genomes.
Moktali V, Park J, Fedorova-Abrams ND, Park B, Choi J, Lee YH, Kang S., BMC Genomics 13(), 2012
PMID: 23033934
GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res. 40(19), 2012
PMID: 22790981
Ortho2ExpressMatrix--a web server that interprets cross-species gene expression data by gene family information.
Meinel T, Schweiger MR, Ludewig AH, Chenna R, Krobitsch S, Herwig R., BMC Genomics 12(), 2011
PMID: 21970648
Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Bocker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS ONE 5(10), 2010
PMID: 20976221
Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
Partitioning clustering algorithms for protein sequence data sets.
Fayech S, Essoussi N, Limam M., BioData Min 2(1), 2009
PMID: 19341454
S-layer, surface-accessible, and concanavalin A binding proteins of Methanosarcina acetivorans and Methanosarcina mazei.
Francoleon DR, Boontheung P, Yang Y, Kin U, Ytterberg AJ, Denny PA, Denny PC, Loo JA, Gunsalus RP, Loo RR., J. Proteome Res. 8(4), 2009
PMID: 19228054
Protein function annotation by homology-based inference.
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A., Genome Biol. 10(2), 2009
PMID: 19226439
MACHOS: Markov clusters of homologous subsequences.
Wong S, Ragan MA., Bioinformatics 24(13), 2008
PMID: 18586748
Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.
Wittkop T, Baumbach J, Lobo FP, Rahmann S., BMC Bioinformatics 8(), 2007
PMID: 17941985
Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?
Birkholtz LM, Bastien O, Wells G, Grando D, Joubert F, Kasam V, Zimmermann M, Ortet P, Jacq N, Saidani N, Roy S, Hofmann-Apitius M, Breton V, Louw AI, Marechal E., Malar. J. 5(), 2006
PMID: 17112376

25 References

Data provided by Europe PubMed Central.

ProtoMap: automatic classification of protein sequences and hierarchy of protein families.
Yona G, Linial N, Linial M., Nucleic Acids Res. 28(1), 2000
PMID: 10592179
ProtoNet: hierarchical classification of the protein space.
Sasson O, Vaaknin A, Fleischer H, Portugaly E, Bilu Y, Linial N, Linial M., Nucleic Acids Res. 31(1), 2003
PMID: 12520020
Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters.
Kriventseva EV, Servant F, Apweiler R., Nucleic Acids Res. 31(1), 2003
PMID: 12520029
iProClass: an integrated, comprehensive and annotated protein classification database.
Wu CH, Xiao C, Hou Z, Huang H, Barker WC., Nucleic Acids Res. 29(1), 2001
PMID: 11125047
PIRSF: family classification system at the Protein Information Resource.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681371
Graph-based clustering for finding distant relationships in a large set of protein sequences.
Kawaji H, Takenaka Y, Matsuda H., Bioinformatics 20(2), 2004
PMID: 14734316
An efficient algorithm for large-scale detection of protein families.
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018
Towards a covering set of protein family profiles.
Heger A, Holm L., Prog. Biophys. Mol. Biol. 73(5), 2000
PMID: 11063778
Domains, motifs and clusters in the protein universe.
Liu J, Rost B., Curr Opin Chem Biol 7(1), 2003
PMID: 12547420
A set-theoretic approach to database searching and clustering.
Krause A, Vingron M., Bioinformatics 14(5), 1998
PMID: 9682056
The Pfam protein families database.
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681378
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M., Nucleic Acids Res. 31(1), 2003
PMID: 12520024
Ensembl 2004.
Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T, Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark C, Clamp M, Hubbard T., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681459
The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community.
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P., Nucleic Acids Res. 31(1), 2003
PMID: 12519987
Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms.
Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, Cherry JM., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681421
The genome sequence of Schizosaccharomyces pombe.
Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, Sgouros J, Peat N, Hayles J, Baker S, Basham D, Bowman S, Brooks K, Brown D, Brown S, Chillingworth T, Churcher C, Collins M, Connor R, Cronin A, Davis P, Feltwell T, Fraser A, Gentles S, Goble A, Hamlin N, Harris D, Hidalgo J, Hodgson G, Holroyd S, Hornsby T, Howarth S, Huckle EJ, Hunt S, Jagels K, James K, Jones L, Jones M, Leather S, McDonald S, McLean J, Mooney P, Moule S, Mungall K, Murphy L, Niblett D, Odell C, Oliver K, O'Neil S, Pearson D, Quail MA, Rabbinowitsch E, Rutherford K, Rutter S, Saunders D, Seeger K, Sharp S, Skelton J, Simmonds M, Squares R, Squares S, Stevens K, Taylor K, Taylor RG, Tivey A, Walsh S, Warren T, Whitehead S, Woodward J, Volckaert G, Aert R, Robben J, Grymonprez B, Weltjens I, Vanstreels E, Rieger M, Schafer M, Muller-Auer S, Gabel C, Fuchs M, Dusterhoft A, Fritzc C, Holzer E, Moestl D, Hilbert H, Borzym K, Langer I, Beck A, Lehrach H, Reinhardt R, Pohl TM, Eger P, Zimmermann W, Wedler H, Wambutt R, Purnelle B, Goffeau A, Cadieu E, Dreano S, Gloux S, Lelaure V, Mottier S, Galibert F, Aves SJ, Xiang Z, Hunt C, Moore K, Hurst SM, Lucas M, Rochet M, Gaillardin C, Tallada VA, Garzon A, Thode G, Daga RR, Cruzado L, Jimenez J, Sanchez M, del Rey F, Benito J, Dominguez A, Revuelta JL, Moreno S, Armstrong J, Forsburg SL, Cerutti L, Lowe T, McCombie WR, Paulsen I, Potashkin J, Shpakovski GV, Ussery D, Barrell BG, Nurse P, Cerrutti L., Nature 415(6874), 2002
PMID: 11859360
The SYSTERS Protein Family Database in 2005.
Meinel T, Krause A, Luz H, Vingron M, Staub E., Nucleic Acids Res. 33(Database issue), 2005
PMID: 15608183
Network analysis. The structure of the Web.
Kleinberg J, Lawrence S., Science 294(5548), 2001
PMID: 11729296
CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis
Sharan R, Shamir R., 2000
Identification of common molecular subsequences.
Smith TF, Waterman MS., J. Mol. Biol. 147(1), 1981
PMID: 7265238
An Algorithm for Clustering cDNAs for Gene Expression Analysis
Hartuv E, Schmitt A, Lange J, Meier-Evert S, Lehrach H, Shamir R., 1999
LEDA: A Platform for Combinatorial and Geometric Computing
Mehlhorn K, Näher S., 1995
The ENZYME database in 2000.
Bairoch A., Nucleic Acids Res. 28(1), 2000
PMID: 10592255
Nouvelles recherches sur la distribution florale
Jaccard P., 1908


0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®


PMID: 15663796
PubMed | Europe PMC

Search this title in

Google Scholar