Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing
Wittkop T, Baumbach J, Lobo FP, Rahmann S (2007)
BMC Bioinformatics 8(1): 396.
Zeitschriftenaufsatz
| Veröffentlicht | Englisch
Download
Autor*in
Wittkop, Tobias;
Baumbach, Jan;
Lobo, Francisco P.;
Rahmann, Sven
Einrichtung
Abstract / Bemerkung
Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
Erscheinungsjahr
2007
Zeitschriftentitel
BMC Bioinformatics
Band
8
Ausgabe
1
Art.-Nr.
396
ISSN
1471-2105
Page URI
https://pub.uni-bielefeld.de/record/1784011
Zitieren
Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.
Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1), 396. https://doi.org/10.1186/1471-2105-8-396
Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. 2007. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8 (1): 396.
Wittkop, T., Baumbach, J., Lobo, F. P., and Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8:396.
Wittkop, T., et al., 2007. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1): 396.
T. Wittkop, et al., “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”, BMC Bioinformatics, vol. 8, 2007, : 396.
Wittkop, T., Baumbach, J., Lobo, F.P., Rahmann, S.: Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 8, : 396 (2007).
Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8.1 (2007): 396.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]
Volltext(e)
Name
Access Level
Open Access
Zuletzt Hochgeladen
2019-09-06T08:48:53Z
MD5 Prüfsumme
c3b69cf1dee5125ff903c896dd7c7fb9
Daten bereitgestellt von European Bioinformatics Institute (EBI)
25 Zitationen in Europe PMC
Daten bereitgestellt von Europe PubMed Central.
Guiding biomedical clustering with ClustEval.
Wiwie C, Baumbach J, Röttger R., Nat Protoc 13(6), 2018
PMID: 29844526
Wiwie C, Baumbach J, Röttger R., Nat Protoc 13(6), 2018
PMID: 29844526
Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.
Bernardes JS, Vieira FR, Costa LM, Zaverucha G., BMC Bioinformatics 16(), 2015
PMID: 25651949
Bernardes JS, Vieira FR, Costa LM, Zaverucha G., BMC Bioinformatics 16(), 2015
PMID: 25651949
Comparing the performance of biomedical clustering methods.
Wiwie C, Baumbach J, Röttger R., Nat Methods 12(11), 2015
PMID: 26389570
Wiwie C, Baumbach J, Röttger R., Nat Methods 12(11), 2015
PMID: 26389570
Networks' Characteristics Matter for Systems Biology.
Rider AK, Milenković T, Siwo GH, Pinapati RS, Emrich SJ, Ferdig MT, Chawla NV., Netw Sci (Camb Univ Press) 2(2), 2014
PMID: 26500772
Rider AK, Milenković T, Siwo GH, Pinapati RS, Emrich SJ, Ferdig MT, Chawla NV., Netw Sci (Camb Univ Press) 2(2), 2014
PMID: 26500772
Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
A laboratory information management system for DNA barcoding workflows.
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310
DEFOG: discrete enrichment of functionally organized genes.
Wittkop T, Berman AE, Fleisch KM, Mooney SD., Integr Biol (Camb) 4(7), 2012
PMID: 22706384
Wittkop T, Berman AE, Fleisch KM, Mooney SD., Integr Biol (Camb) 4(7), 2012
PMID: 22706384
GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981
PolyQ: a database describing the sequence and domain context of polyglutamine repeats in proteins.
Robertson AL, Bate MA, Androulakis SG, Bottomley SP, Buckle AM., Nucleic Acids Res 39(database issue), 2011
PMID: 21059684
Robertson AL, Bate MA, Androulakis SG, Bottomley SP, Buckle AM., Nucleic Acids Res 39(database issue), 2011
PMID: 21059684
Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution.
Apeltsin L, Morris JH, Babbitt PC, Ferrin TE., Bioinformatics 27(3), 2011
PMID: 21118823
Apeltsin L, Morris JH, Babbitt PC, Ferrin TE., Bioinformatics 27(3), 2011
PMID: 21118823
Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Discovery and annotation of small proteins using genomics, proteomics, and computational approaches.
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA., Genome Res 21(4), 2011
PMID: 21367939
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA., Genome Res 21(4), 2011
PMID: 21367939
Ultra-fast sequence clustering from similarity networks with SiLiX.
Miele V, Penel S, Duret L., BMC Bioinformatics 12(), 2011
PMID: 21513511
Miele V, Penel S, Duret L., BMC Bioinformatics 12(), 2011
PMID: 21513511
Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis.
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M., BMC Bioinformatics 12(), 2011
PMID: 21612636
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M., BMC Bioinformatics 12(), 2011
PMID: 21612636
Genome sequence of a mesophilic hydrogenotrophic methanogen Methanocella paludicola, the first cultivated representative of the order Methanocellales.
Sakai S, Takaki Y, Shimamura S, Sekine M, Tajima T, Kosugi H, Ichikawa N, Tasumi E, Hiraki AT, Shimizu A, Kato Y, Nishiko R, Mori K, Fujita N, Imachi H, Takai K., PLoS One 6(7), 2011
PMID: 21829548
Sakai S, Takaki Y, Shimamura S, Sekine M, Tajima T, Kosugi H, Ichikawa N, Tasumi E, Hiraki AT, Shimizu A, Kato Y, Nishiko R, Mori K, Fujita N, Imachi H, Takai K., PLoS One 6(7), 2011
PMID: 21829548
clusterMaker: a multi-algorithm clustering plugin for Cytoscape.
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE., BMC Bioinformatics 12(), 2011
PMID: 22070249
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE., BMC Bioinformatics 12(), 2011
PMID: 22070249
Partitioning biological data with transitivity clustering.
Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J., Nat Methods 7(6), 2010
PMID: 20508635
Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J., Nat Methods 7(6), 2010
PMID: 20508635
On the power and limits of evolutionary conservation--unraveling bacterial gene regulatory networks.
Baumbach J., Nucleic Acids Res 38(22), 2010
PMID: 20699275
Baumbach J., Nucleic Acids Res 38(22), 2010
PMID: 20699275
Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221
Towards the integrated analysis, visualization and reconstruction of microbial gene regulatory networks.
Baumbach J, Tauch A, Rahmann S., Brief Bioinform 10(1), 2009
PMID: 19074493
Baumbach J, Tauch A, Rahmann S., Brief Bioinform 10(1), 2009
PMID: 19074493
Integrated analysis and reconstruction of microbial transcriptional gene regulatory networks using CoryneRegNet.
Baumbach J, Wittkop T, Kleindt CK, Tauch A., Nat Protoc 4(6), 2009
PMID: 19498379
Baumbach J, Wittkop T, Kleindt CK, Tauch A., Nat Protoc 4(6), 2009
PMID: 19498379
Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms.
Baumbach J, Rahmann S, Tauch A., BMC Syst Biol 3(), 2009
PMID: 19146695
Baumbach J, Rahmann S, Tauch A., BMC Syst Biol 3(), 2009
PMID: 19146695
Genetic makeup of the Corynebacterium glutamicum LexA regulon deduced from comparative transcriptomics and in vitro DNA band shift assays.
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Hüser AT, Hansmeier N, Pühler A, Borovok I, Tauch A., Microbiology 155(pt 5), 2009
PMID: 19372162
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Hüser AT, Hansmeier N, Pühler A, Borovok I, Tauch A., Microbiology 155(pt 5), 2009
PMID: 19372162
Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
Force feature spaces for visualization and classification.
Veljkovic D, Robbins KA., Int Conf Digit Signal Process Proc 2008(), 2008
PMID: 20676225
Veljkovic D, Robbins KA., Int Conf Digit Signal Process Proc 2008(), 2008
PMID: 20676225
27 References
Daten bereitgestellt von Europe PubMed Central.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
Exact and heuristic algorithms for weighted cluster editing.
Rahmann S, Wittkop T, Baumbach J, Martin M, Truss A, Bocker S., Comput Syst Bioinformatics Conf 6(), 2007
PMID: 17951842
Rahmann S, Wittkop T, Baumbach J, Martin M, Truss A, Bocker S., Comput Syst Bioinformatics Conf 6(), 2007
PMID: 17951842
On best transitive approximations of simple graphs
Delvaux S, Horsten L., 2004
Delvaux S, Horsten L., 2004
Cluster graph modification problems
Shamir R, Sharan R, Tsur D., 2004
Shamir R, Sharan R, Tsur D., 2004
ProClust: improved clustering of protein sequences with an extended graph-based approach.
Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R., Bioinformatics 18 Suppl 2(), 2002
PMID: 12386002
Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R., Bioinformatics 18 Suppl 2(), 2002
PMID: 12386002
Large scale hierarchical clustering of protein sequences.
Krause A, Stoye J, Vingron M., BMC Bioinformatics 6(), 2005
PMID: 15663796
Krause A, Stoye J, Vingron M., BMC Bioinformatics 6(), 2005
PMID: 15663796
Spectral clustering of protein sequences.
Paccanaro A, Casbon JA, Saqi MA., Nucleic Acids Res. 34(5), 2006
PMID: 16547200
Paccanaro A, Casbon JA, Saqi MA., Nucleic Acids Res. 34(5), 2006
PMID: 16547200
An efficient algorithm for large-scale detection of protein families.
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018
Everitt BS., 1993
GeneRAGE: a robust algorithm for sequence clustering and domain detection.
Enright AJ, Ouzounis CA., Bioinformatics 16(5), 2000
PMID: 10871267
Enright AJ, Ouzounis CA., Bioinformatics 16(5), 2000
PMID: 10871267
Clustering by passing messages between data points.
Frey BJ, Dueck D., Science 315(5814), 2007
PMID: 17218491
Frey BJ, Dueck D., Science 315(5814), 2007
PMID: 17218491
The COG database: an updated version includes eukaryotes.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA., BMC Bioinformatics 4(), 2003
PMID: 12969510
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA., BMC Bioinformatics 4(), 2003
PMID: 12969510
Graph drawing by force-directed placement
Fruchterman TMJ, Reingold EM., 1991
Fruchterman TMJ, Reingold EM., 1991
Protein complex prediction via cost-based clustering.
King AD, Przulj N, Jurisica I., Bioinformatics 20(17), 2004
PMID: 15180928
King AD, Przulj N, Jurisica I., Bioinformatics 20(17), 2004
PMID: 15180928
SCOP database in 2004: refinements integrate structure and sequence family data.
Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681400
Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681400
The ASTRAL Compendium in 2004.
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681391
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681391
SCOP website
AUTHOR UNKNOWN, 0
AUTHOR UNKNOWN, 0
ASTRAL website
AUTHOR UNKNOWN, 0
AUTHOR UNKNOWN, 0
COG website
AUTHOR UNKNOWN, 0
AUTHOR UNKNOWN, 0
COG sequences (FTP)
AUTHOR UNKNOWN, 0
AUTHOR UNKNOWN, 0
CoryneRegNet: an ontology-based data warehouse of corynebacterial transcription factors and regulatory networks.
Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A., BMC Genomics 7(), 2006
PMID: 16478536
Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A., BMC Genomics 7(), 2006
PMID: 16478536
CoryneRegNet 3.0--an interactive systems biology platform for the analysis of gene regulatory networks in corynebacteria and Escherichia coli.
Baumbach J, Wittkop T, Rademacher K, Rahmann S, Brinkrolf K, Tauch A., J. Biotechnol. 129(2), 2006
PMID: 17229482
Baumbach J, Wittkop T, Rademacher K, Rahmann S, Brinkrolf K, Tauch A., J. Biotechnol. 129(2), 2006
PMID: 17229482
Automated generation of search tree algorithms for hard graph modification problems
Gramm J, Guo J, Hüffner F, Niedermeier R., 2004
Gramm J, Guo J, Hüffner F, Niedermeier R., 2004
Graph-modeled data clustering: Exact algorithm for clique generation
Gramm J, Guo J, Hüffner F, Niedermeier R., 2005
Gramm J, Guo J, Hüffner F, Niedermeier R., 2005
The Cluster Editing Problem: Implementations and Experiments
Dehne F, Langston MA, Luo X, Pitre S, Shaw P, Zhang Y., 2006
Dehne F, Langston MA, Luo X, Pitre S, Shaw P, Zhang Y., 2006
CoryneRegNet website
AUTHOR UNKNOWN, 0
AUTHOR UNKNOWN, 0
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469
Export
Markieren/ Markierung löschen
Markierte Publikationen
Web of Science
Dieser Datensatz im Web of Science®Quellen
PMID: 17941985
PubMed | Europe PMC
Suchen in