Evaluation data for "On the use of sequence-quality information in OTU clustering"

Name: Evaluation data for "On the use of sequence-quality information in OTU clustering"
License: https://creativecommons.org/licenses/by-nc/4.0/

Müller, Robert; Nebel, Markus

Evaluation data for "On the use of sequence-quality information in OTU clustering"

Müller R, Nebel M (2021)
Bielefeld University.

Datenpublikation

Download

callahan_data.tar.bz2 212.94 MB

dada2_callahan_data.tar.bz2 219.35 MB

franzen_data.tar.bz2 80.02 MB
Alle

DOI

https://doi.org/10.4119/unibi/2951742

Creator

Müller, Robert^UniBi ; Nebel, Markus^UniBi

Einrichtung

Technische Fakultät > AG Algorithmik und Bioinformatik

Abstract / Bemerkung

Prepared data sets and aggregated results of the comparative evaluation of [GeFaST](https://github.com/romueller/gefast)'s quality-aware clustering and refinement methods performed in "On the use of sequence-quality information in OTU clustering" (submitted).

GeFaST is compared to DADA2, USEARCH, VSEARCH, UPARSE and Swarm on two collections of data sets described in [*DADA2: High-resolution sample inference from Illumina amplicon data*](https://doi.org/10.1038/nmeth.3869) (Callahan et al.) and [*Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering*](https://doi.org/10.1186/s40168-015-0105-6) (Franzén et al.).

The provided files allow to repeat the evaluation by rerunning the tools or just reanalysing the results. The evaluation repository is available [here](https://github.com/romueller/gefast-qa-evaluation).

*Input files:*

**callahan_data.tar.bz2**: Read files and ground truths of the Callahan data sets used by GeFaST, USEARCH, VSEARCH, UPARSE and Swarm. This archive should be extracted in the `analyses/model_supported_callahan/`, `analyses/quality_weighted_callahan/`, `analyses/swarm_callahan/`, `analyses/uvsearch_callahan/` and `analyses/performance/` subfolder of the evaluation repository. **dada2_callahan_data.tar.bz2**: Read files and ground truths of the Callahan data sets used by DADA2. This archive should be extracted in the `analyses/dada2_callahan/` subfolder of the evaluation repository. The workflow of DADA2 differs from the other examined tools and, thus, involves slightly different ground-truth files in order to assess the quality of the reconstructed clusters. **franzen_data.tar.bz2**: Read files and ground truths of the Franzén data sets used by all tools. This archive should be extracted in the `analyses/dada2_franzen/`, `analyses/model_supported_franzen/`, `analyses/quality_weighted_franzen/`, `analyses/swarm_franzen/` and `analyses/uvsearch_franzen/` subfolder of the evaluation repository. Since the origin of *in silico* sequenced amplicons is known, the ground truths can also be used for DADA2.

*Aggregated results:*

**dada2_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of DADA2 on the Callahan data sets. This archive should be extracted in the `analyses/dada2_callahan/` subfolder of the evaluation repository. **dada2_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of DADA2 on the Franzén data sets. This archive should be extracted in the `analyses/dada2_franzen/` subfolder of the evaluation repository. **model_supported_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of the model-supported clustering and refinement methods of GeFaST on the Callahan data sets. This archive should be extracted in the `analyses/model_supported_callahan/` subfolder of the evaluation repository. **model_supported_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of the model-supported clustering and refinement methods of GeFaST on the Franzén data sets. This archive should be extracted in the `analyses/model_supported_franzen/` subfolder of the evaluation repository. **performance_evaluation.tar.bz2**: Aggregated results (clustering quality, runtime, memory consumption) of the different runs of GeFaST, USEARCH, VSEARCH, UPARSE and Swarm on the largest Callahan data set (hmp_single). This archive should be extracted in the `analyses/performance/` subfolder of the evaluation repository. **quality_weighted_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of GeFaST involving quality-weighted alignments on the Callahan data sets. This archive should be extracted in the `analyses/quality_weighted_callahan/` subfolder of the evaluation repository. **quality_weighted_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of GeFaST involving quality-weighted alignments on the Franzén data sets. This archive should be extracted in the `analyses/quality_weighted_franzen/` subfolder of the evaluation repository. **swarm_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of Swarm on the Callahan data sets. This archive should be extracted in the `analyses/swarm_callahan/` subfolder of the evaluation repository. **swarm_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of Swarm on the Franzén data sets. This archive should be extracted in the `analyses/swarm_franzen/` subfolder of the evaluation repository. **uvsearch_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of USEARCH, VSEARCH and UPARSE on the Callahan data sets. This archive should be extracted in the `analyses/uvsearch_callahan/` subfolder of the evaluation repository. **uvsearch_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of USEARCH, VSEARCH and UPARSE on the Franzén data sets. This archive should be extracted in the `analyses/uvsearch_franzen/` subfolder of the evaluation repository.

Erscheinungsjahr

2021

Creative Commons Namensnennung-Nicht kommerziell 4.0 International (CC BY-NC 4.0)

Page URI

https://pub.uni-bielefeld.de/record/2951742

Zitieren

Müller R, Nebel M. Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University; 2021.

Müller, R., & Nebel, M. (2021). Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University. https://doi.org/10.4119/unibi/2951742

Müller, Robert, and Nebel, Markus. 2021. Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University.

Müller, R., and Nebel, M. (2021). Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University.

Müller, R., & Nebel, M., 2021. Evaluation data for "On the use of sequence-quality information in OTU clustering", Bielefeld University.

R. Müller and M. Nebel, Evaluation data for "On the use of sequence-quality information in OTU clustering", Bielefeld University, 2021.

Müller, R., Nebel, M.: Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University (2021).

Müller, Robert, and Nebel, Markus. Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University, 2021.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Creative Commons Namensnennung-Nicht kommerziell 4.0 International (CC BY-NC 4.0):

https://creativecommons.org/licenses/by-nc/4.0/deed.de
https://creativecommons.org/licenses/by-nc/4.0/legalcode.de

Volltext(e)

Name

callahan_data.tar.bz2 212.94 MB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:39:25Z

MD5 Prüfsumme

ef5278068515d5c7baebf4e3d5d35f49

Name

dada2_callahan_data.tar.bz2 219.35 MB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:39:55Z

MD5 Prüfsumme

d0910082a80737803ad0698260af9d91

Name

franzen_data.tar.bz2 80.02 MB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:40:06Z

MD5 Prüfsumme

8b533b2bba640c7633a891e0ac421601

Name

dada2_callahan_evaluation.tar.bz2 1.52 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:40:23Z

MD5 Prüfsumme

9e7c61047f4b03aab0707814885e2c4f

Name

dada2_franzen_evaluation.tar.bz2 6.59 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:40:34Z

MD5 Prüfsumme

ec847fa4038a82f0aadfa6ae8a0adb55

Name

model_supported_callahan_evaluation.tar.bz2 106.54 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:40:53Z

MD5 Prüfsumme

3dbf0466213d488bfb8a8eab87b623a0

Name

model_supported_franzen_evaluation.tar.bz2 1.03 MB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:41:03Z

MD5 Prüfsumme

667df48fe99b3e58dcf13fee3117ac43

Name

performance_evaluation.tar.bz2 22.82 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:41:26Z

MD5 Prüfsumme

a6c0f781d9932e513f7bcbbfce8635e1

Name

quality_weighted_callahan_evaluation.tar.bz2 4.91 MB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:41:37Z

MD5 Prüfsumme

1ffb13bda81837ec5d12cd4994690980

Name

quality_weighted_franzen_evaluation.tar.bz2 29.28 MB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:41:48Z

MD5 Prüfsumme

35e5a9d2e63ff92da7f4a338fe5aa3d3

Name

swarm_callahan_evaluation.tar.bz2 3.50 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:41:54Z

MD5 Prüfsumme

91ad4a7209028bba429ad909b03fbb26

Name

swarm_franzen_evaluation.tar.bz2 37.71 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:42:02Z

MD5 Prüfsumme

8b312fe3d02bd994aa2874ba7a4246c7

Name

uvsearch_callahan_evaluation.tar.bz2 15.49 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:42:11Z

MD5 Prüfsumme

88f67d2aee0dd153788ea3970ff3c26c

Name

uvsearch_franzen_evaluation.tar.bz2 214.92 KB

Access Level

Open Access

Zuletzt Hochgeladen

2021-03-18T14:42:20Z

MD5 Prüfsumme

3ef373193d6b39a4fbb065bd6b46f659

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld

Evaluation data for "On the use of sequence-quality information in OTU clustering"

Zitieren