Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing

Wittkop T, Baumbach J, Lobo FP, Rahmann S (2007)
BMC Bioinformatics 8(1).

Download
OA
Journal Article | Published | English
Author
; ; ;
Abstract
Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
Publishing Year
ISSN
PUB-ID

Cite this

Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1).
Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1).
Wittkop, T., Baumbach, J., Lobo, F. P., and Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8.
Wittkop, T., et al., 2007. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1).
T. Wittkop, et al., “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”, BMC Bioinformatics, vol. 8, 2007.
Wittkop, T., Baumbach, J., Lobo, F.P., Rahmann, S.: Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 8, (2007).
Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8.1 (2007).
Main File(s)
File Name
Access Level
OA Open Access

This data publication is cited in the following publications:
This publication cites the following data publications:

22 Citations in Europe PMC

Data provided by Europe PubMed Central.

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.
Bernardes JS, Vieira FR, Costa LM, Zaverucha G., BMC Bioinformatics 16(), 2015
PMID: 25651949
Networks' Characteristics Matter for Systems Biology.
Rider AK, Milenkovic T, Siwo GH, Pinapati RS, Emrich SJ, Ferdig MT, Chawla NV., Netw Sci (Camb Univ Press) 2(2), 2014
PMID: 26500772
Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szoke S, Wiwie C, Baumbach J, Cardinali G, Rottger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res. 40(19), 2012
PMID: 22790981
DEFOG: discrete enrichment of functionally organized genes.
Wittkop T, Berman AE, Fleisch KM, Mooney SD., Integr Biol (Camb) 4(7), 2012
PMID: 22706384
clusterMaker: a multi-algorithm clustering plugin for Cytoscape.
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE., BMC Bioinformatics 12(), 2011
PMID: 22070249
Genome sequence of a mesophilic hydrogenotrophic methanogen Methanocella paludicola, the first cultivated representative of the order Methanocellales.
Sakai S, Takaki Y, Shimamura S, Sekine M, Tajima T, Kosugi H, Ichikawa N, Tasumi E, Hiraki AT, Shimizu A, Kato Y, Nishiko R, Mori K, Fujita N, Imachi H, Takai K., PLoS ONE 6(7), 2011
PMID: 21829548
Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis.
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M., BMC Bioinformatics 12(), 2011
PMID: 21612636
Ultra-fast sequence clustering from similarity networks with SiLiX.
Miele V, Penel S, Duret L., BMC Bioinformatics 12(), 2011
PMID: 21513511
Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Bocker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Discovery and annotation of small proteins using genomics, proteomics, and computational approaches.
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA., Genome Res. 21(4), 2011
PMID: 21367939
PolyQ: a database describing the sequence and domain context of polyglutamine repeats in proteins.
Robertson AL, Bate MA, Androulakis SG, Bottomley SP, Buckle AM., Nucleic Acids Res. 39(Database issue), 2011
PMID: 21059684
Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS ONE 5(10), 2010
PMID: 20976221
Partitioning biological data with transitivity clustering.
Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, Bocker S, Stoye J, Baumbach J., Nat. Methods 7(6), 2010
PMID: 20508635
Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
Genetic makeup of the Corynebacterium glutamicum LexA regulon deduced from comparative transcriptomics and in vitro DNA band shift assays.
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Huser AT, Hansmeier N, Puhler A, Borovok I, Tauch A., Microbiology (Reading, Engl.) 155(Pt 5), 2009
PMID: 19372162
Towards the integrated analysis, visualization and reconstruction of microbial gene regulatory networks.
Baumbach J, Tauch A, Rahmann S., Brief. Bioinformatics 10(1), 2009
PMID: 19074493
Force feature spaces for visualization and classification.
Veljkovic D, Robbins KA., Int Conf Digit Signal Process Proc 2008(), 2008
PMID: 20676225

27 References

Data provided by Europe PubMed Central.

CoryneRegNet website
AUTHOR UNKNOWN, 0
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469

Export

0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®

Sources

PMID: 17941985
PubMed | Europe PMC

Search this title in

Google Scholar