Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing

Wittkop, Tobias; Baumbach, Jan; Lobo, Francisco P.; Rahmann, Sven

Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing

Wittkop T, Baumbach J, Lobo FP, Rahmann S (2007)
BMC Bioinformatics 8(1): 396.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

baumb.pdf

DOI

https://doi.org/10.1186/1471-2105-8-396

URN

urn:nbn:de:0070-pub-17840118

Autor*in

Wittkop, Tobias; Baumbach, Jan; Lobo, Francisco P.; Rahmann, Sven

Einrichtung

Centrum für Biotechnologie

Abstract / Bemerkung

Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.

Erscheinungsjahr

2007

Zeitschriftentitel

BMC Bioinformatics

Band

Ausgabe

Art.-Nr.

396

ISSN

1471-2105

Page URI

https://pub.uni-bielefeld.de/record/1784011

Zitieren

Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.

Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1), 396. https://doi.org/10.1186/1471-2105-8-396

Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. 2007. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8 (1): 396.

Wittkop, T., Baumbach, J., Lobo, F. P., and Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8:396.

Wittkop, T., et al., 2007. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1): 396.

T. Wittkop, et al., “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”, BMC Bioinformatics, vol. 8, 2007, : 396.

Wittkop, T., Baumbach, J., Lobo, F.P., Rahmann, S.: Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 8, : 396 (2007).

Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8.1 (2007): 396.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Copyright Statement:

Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]

Volltext(e)

Name

baumb.pdf

Access Level

Open Access

Zuletzt Hochgeladen

2019-09-06T08:48:53Z

MD5 Prüfsumme

c3b69cf1dee5125ff903c896dd7c7fb9

Daten bereitgestellt von European Bioinformatics Institute (EBI)

25 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Guiding biomedical clustering with ClustEval.
Wiwie C, Baumbach J, Röttger R., Nat Protoc 13(6), 2018
PMID: 29844526

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.
Bernardes JS, Vieira FR, Costa LM, Zaverucha G., BMC Bioinformatics 16(), 2015
PMID: 25651949

Comparing the performance of biomedical clustering methods.
Wiwie C, Baumbach J, Röttger R., Nat Methods 12(11), 2015
PMID: 26389570

Networks' Characteristics Matter for Systems Biology.
Rider AK, Milenković T, Siwo GH, Pinapati RS, Emrich SJ, Ferdig MT, Chawla NV., Netw Sci (Camb Univ Press) 2(2), 2014
PMID: 26500772

Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642

A laboratory information management system for DNA barcoding workflows.
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310

DEFOG: discrete enrichment of functionally organized genes.
Wittkop T, Berman AE, Fleisch KM, Mooney SD., Integr Biol (Camb) 4(7), 2012
PMID: 22706384

GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981

PolyQ: a database describing the sequence and domain context of polyglutamine repeats in proteins.
Robertson AL, Bate MA, Androulakis SG, Bottomley SP, Buckle AM., Nucleic Acids Res 39(database issue), 2011
PMID: 21059684

Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution.
Apeltsin L, Morris JH, Babbitt PC, Ferrin TE., Bioinformatics 27(3), 2011
PMID: 21118823

Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810

Discovery and annotation of small proteins using genomics, proteomics, and computational approaches.
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA., Genome Res 21(4), 2011
PMID: 21367939

Ultra-fast sequence clustering from similarity networks with SiLiX.
Miele V, Penel S, Duret L., BMC Bioinformatics 12(), 2011
PMID: 21513511

Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis.
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M., BMC Bioinformatics 12(), 2011
PMID: 21612636

Genome sequence of a mesophilic hydrogenotrophic methanogen Methanocella paludicola, the first cultivated representative of the order Methanocellales.
Sakai S, Takaki Y, Shimamura S, Sekine M, Tajima T, Kosugi H, Ichikawa N, Tasumi E, Hiraki AT, Shimizu A, Kato Y, Nishiko R, Mori K, Fujita N, Imachi H, Takai K., PLoS One 6(7), 2011
PMID: 21829548

clusterMaker: a multi-algorithm clustering plugin for Cytoscape.
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE., BMC Bioinformatics 12(), 2011
PMID: 22070249

Partitioning biological data with transitivity clustering.
Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J., Nat Methods 7(6), 2010
PMID: 20508635

On the power and limits of evolutionary conservation--unraveling bacterial gene regulatory networks.
Baumbach J., Nucleic Acids Res 38(22), 2010
PMID: 20699275

Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221

Towards the integrated analysis, visualization and reconstruction of microbial gene regulatory networks.
Baumbach J, Tauch A, Rahmann S., Brief Bioinform 10(1), 2009
PMID: 19074493

Integrated analysis and reconstruction of microbial transcriptional gene regulatory networks using CoryneRegNet.
Baumbach J, Wittkop T, Kleindt CK, Tauch A., Nat Protoc 4(6), 2009
PMID: 19498379

Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms.
Baumbach J, Rahmann S, Tauch A., BMC Syst Biol 3(), 2009
PMID: 19146695

Genetic makeup of the Corynebacterium glutamicum LexA regulon deduced from comparative transcriptomics and in vitro DNA band shift assays.
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Hüser AT, Hansmeier N, Pühler A, Borovok I, Tauch A., Microbiology 155(pt 5), 2009
PMID: 19372162

Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015

Force feature spaces for visualization and classification.
Veljkovic D, Robbins KA., Int Conf Digit Signal Process Proc 2008(), 2008
PMID: 20676225

27 References

Daten bereitgestellt von Europe PubMed Central.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694

Exact and heuristic algorithms for weighted cluster editing.
Rahmann S, Wittkop T, Baumbach J, Martin M, Truss A, Bocker S., Comput Syst Bioinformatics Conf 6(), 2007
PMID: 17951842

On best transitive approximations of simple graphs
Delvaux S, Horsten L., 2004

Cluster graph modification problems
Shamir R, Sharan R, Tsur D., 2004

ProClust: improved clustering of protein sequences with an extended graph-based approach.
Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R., Bioinformatics 18 Suppl 2(), 2002
PMID: 12386002

Large scale hierarchical clustering of protein sequences.
Krause A, Stoye J, Vingron M., BMC Bioinformatics 6(), 2005
PMID: 15663796

Spectral clustering of protein sequences.
Paccanaro A, Casbon JA, Saqi MA., Nucleic Acids Res. 34(5), 2006
PMID: 16547200

An efficient algorithm for large-scale detection of protein families.
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018

Everitt BS., 1993

GeneRAGE: a robust algorithm for sequence clustering and domain detection.
Enright AJ, Ouzounis CA., Bioinformatics 16(5), 2000
PMID: 10871267

Clustering by passing messages between data points.
Frey BJ, Dueck D., Science 315(5814), 2007
PMID: 17218491

The COG database: an updated version includes eukaryotes.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA., BMC Bioinformatics 4(), 2003
PMID: 12969510

Graph drawing by force-directed placement
Fruchterman TMJ, Reingold EM., 1991

Protein complex prediction via cost-based clustering.
King AD, Przulj N, Jurisica I., Bioinformatics 20(17), 2004
PMID: 15180928

SCOP database in 2004: refinements integrate structure and sequence family data.
Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681400

The ASTRAL Compendium in 2004.
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681391

SCOP website
AUTHOR UNKNOWN, 0

ASTRAL website
AUTHOR UNKNOWN, 0

COG website
AUTHOR UNKNOWN, 0

COG sequences (FTP)
AUTHOR UNKNOWN, 0

CoryneRegNet: an ontology-based data warehouse of corynebacterial transcription factors and regulatory networks.
Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A., BMC Genomics 7(), 2006
PMID: 16478536

CoryneRegNet 3.0--an interactive systems biology platform for the analysis of gene regulatory networks in corynebacteria and Escherichia coli.
Baumbach J, Wittkop T, Rademacher K, Rahmann S, Brinkrolf K, Tauch A., J. Biotechnol. 129(2), 2006
PMID: 17229482

Automated generation of search tree algorithms for hard graph modification problems
Gramm J, Guo J, Hüffner F, Niedermeier R., 2004

Graph-modeled data clustering: Exact algorithm for clique generation
Gramm J, Guo J, Hüffner F, Niedermeier R., 2005

The Cluster Editing Problem: Implementations and Experiments
Dehne F, Langston MA, Luo X, Pitre S, Shaw P, Zhang Y., 2006

CoryneRegNet website
AUTHOR UNKNOWN, 0

Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 17941985
PubMed | Europe PMC

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld