Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing

Wittkop T, Baumbach J, Lobo FP, Rahmann S (2007)
BMC Bioinformatics 8(1): 396.

Download
OA
Zeitschriftenaufsatz | Veröffentlicht | Englisch
Volltext vorhanden für diesen Nachweis
Autor
; ; ;
Abstract / Bemerkung
Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
Erscheinungsjahr
Zeitschriftentitel
BMC Bioinformatics
Band
8
Zeitschriftennummer
1
Seite
396
ISSN
PUB-ID

Zitieren

Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1):396.
Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1), 396. doi:10.1186/1471-2105-8-396
Wittkop, T., Baumbach, J., Lobo, F. P., and Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8, 396.
Wittkop, T., et al., 2007. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1), p 396.
T. Wittkop, et al., “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”, BMC Bioinformatics, vol. 8, 2007, pp. 396.
Wittkop, T., Baumbach, J., Lobo, F.P., Rahmann, S.: Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 8, 396 (2007).
Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8.1 (2007): 396.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
1970-01-01T00:00:00Z

24 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.
Bernardes JS, Vieira FR, Costa LM, Zaverucha G., BMC Bioinformatics 16(), 2015
PMID: 25651949
Comparing the performance of biomedical clustering methods.
Wiwie C, Baumbach J, Röttger R., Nat Methods 12(11), 2015
PMID: 26389570
Networks' Characteristics Matter for Systems Biology.
Rider AK, Milenković T, Siwo GH, Pinapati RS, Emrich SJ, Ferdig MT, Chawla NV., Netw Sci (Camb Univ Press) 2(2), 2014
PMID: 26500772
Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
A laboratory information management system for DNA barcoding workflows.
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310
DEFOG: discrete enrichment of functionally organized genes.
Wittkop T, Berman AE, Fleisch KM, Mooney SD., Integr Biol (Camb) 4(7), 2012
PMID: 22706384
GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981
PolyQ: a database describing the sequence and domain context of polyglutamine repeats in proteins.
Robertson AL, Bate MA, Androulakis SG, Bottomley SP, Buckle AM., Nucleic Acids Res 39(database issue), 2011
PMID: 21059684
Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Discovery and annotation of small proteins using genomics, proteomics, and computational approaches.
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA., Genome Res 21(4), 2011
PMID: 21367939
Ultra-fast sequence clustering from similarity networks with SiLiX.
Miele V, Penel S, Duret L., BMC Bioinformatics 12(), 2011
PMID: 21513511
Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis.
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M., BMC Bioinformatics 12(), 2011
PMID: 21612636
Genome sequence of a mesophilic hydrogenotrophic methanogen Methanocella paludicola, the first cultivated representative of the order Methanocellales.
Sakai S, Takaki Y, Shimamura S, Sekine M, Tajima T, Kosugi H, Ichikawa N, Tasumi E, Hiraki AT, Shimizu A, Kato Y, Nishiko R, Mori K, Fujita N, Imachi H, Takai K., PLoS One 6(7), 2011
PMID: 21829548
clusterMaker: a multi-algorithm clustering plugin for Cytoscape.
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE., BMC Bioinformatics 12(), 2011
PMID: 22070249
Partitioning biological data with transitivity clustering.
Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J., Nat Methods 7(6), 2010
PMID: 20508635
Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221
Genetic makeup of the Corynebacterium glutamicum LexA regulon deduced from comparative transcriptomics and in vitro DNA band shift assays.
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Hüser AT, Hansmeier N, Pühler A, Borovok I, Tauch A., Microbiology 155(pt 5), 2009
PMID: 19372162
Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
Force feature spaces for visualization and classification.
Veljkovic D, Robbins KA., Int Conf Digit Signal Process Proc 2008(), 2008
PMID: 20676225

27 References

Daten bereitgestellt von Europe PubMed Central.

CoryneRegNet website
AUTHOR UNKNOWN, 0
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 17941985
PubMed | Europe PMC

Suchen in

Google Scholar