Evaluation data for "On the use of sequence-quality information in OTU clustering"
Müller R, Nebel M (2021)
Bielefeld University.
Datenpublikation
Download
callahan_data.tar.bz2
212.94 MB
dada2_callahan_data.tar.bz2 219.35 MB
franzen_data.tar.bz2 80.02 MB
Alle
dada2_callahan_data.tar.bz2 219.35 MB
franzen_data.tar.bz2 80.02 MB
Alle
Creator
Abstract / Bemerkung
Prepared data sets and aggregated results of the comparative evaluation of [GeFaST](https://github.com/romueller/gefast)'s quality-aware clustering and refinement methods performed in "On the use of sequence-quality information in OTU clustering" (submitted).
GeFaST is compared to DADA2, USEARCH, VSEARCH, UPARSE and Swarm on two collections of data sets described in [*DADA2: High-resolution sample inference from Illumina amplicon data*](https://doi.org/10.1038/nmeth.3869) (Callahan et al.) and [*Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering*](https://doi.org/10.1186/s40168-015-0105-6) (Franzén et al.).
The provided files allow to repeat the evaluation by rerunning the tools or just reanalysing the results. The evaluation repository is available [here](https://github.com/romueller/gefast-qa-evaluation).
*Input files:*
**callahan_data.tar.bz2**: Read files and ground truths of the Callahan data sets used by GeFaST, USEARCH, VSEARCH, UPARSE and Swarm. This archive should be extracted in the `analyses/model_supported_callahan/`, `analyses/quality_weighted_callahan/`, `analyses/swarm_callahan/`, `analyses/uvsearch_callahan/` and `analyses/performance/` subfolder of the evaluation repository. **dada2_callahan_data.tar.bz2**: Read files and ground truths of the Callahan data sets used by DADA2. This archive should be extracted in the `analyses/dada2_callahan/` subfolder of the evaluation repository. The workflow of DADA2 differs from the other examined tools and, thus, involves slightly different ground-truth files in order to assess the quality of the reconstructed clusters. **franzen_data.tar.bz2**: Read files and ground truths of the Franzén data sets used by all tools. This archive should be extracted in the `analyses/dada2_franzen/`, `analyses/model_supported_franzen/`, `analyses/quality_weighted_franzen/`, `analyses/swarm_franzen/` and `analyses/uvsearch_franzen/` subfolder of the evaluation repository. Since the origin of *in silico* sequenced amplicons is known, the ground truths can also be used for DADA2.
*Aggregated results:*
**dada2_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of DADA2 on the Callahan data sets. This archive should be extracted in the `analyses/dada2_callahan/` subfolder of the evaluation repository. **dada2_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of DADA2 on the Franzén data sets. This archive should be extracted in the `analyses/dada2_franzen/` subfolder of the evaluation repository. **model_supported_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of the model-supported clustering and refinement methods of GeFaST on the Callahan data sets. This archive should be extracted in the `analyses/model_supported_callahan/` subfolder of the evaluation repository. **model_supported_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of the model-supported clustering and refinement methods of GeFaST on the Franzén data sets. This archive should be extracted in the `analyses/model_supported_franzen/` subfolder of the evaluation repository. **performance_evaluation.tar.bz2**: Aggregated results (clustering quality, runtime, memory consumption) of the different runs of GeFaST, USEARCH, VSEARCH, UPARSE and Swarm on the largest Callahan data set (hmp_single). This archive should be extracted in the `analyses/performance/` subfolder of the evaluation repository. **quality_weighted_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of GeFaST involving quality-weighted alignments on the Callahan data sets. This archive should be extracted in the `analyses/quality_weighted_callahan/` subfolder of the evaluation repository. **quality_weighted_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of GeFaST involving quality-weighted alignments on the Franzén data sets. This archive should be extracted in the `analyses/quality_weighted_franzen/` subfolder of the evaluation repository. **swarm_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of Swarm on the Callahan data sets. This archive should be extracted in the `analyses/swarm_callahan/` subfolder of the evaluation repository. **swarm_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of Swarm on the Franzén data sets. This archive should be extracted in the `analyses/swarm_franzen/` subfolder of the evaluation repository. **uvsearch_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of USEARCH, VSEARCH and UPARSE on the Callahan data sets. This archive should be extracted in the `analyses/uvsearch_callahan/` subfolder of the evaluation repository. **uvsearch_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of USEARCH, VSEARCH and UPARSE on the Franzén data sets. This archive should be extracted in the `analyses/uvsearch_franzen/` subfolder of the evaluation repository.
GeFaST is compared to DADA2, USEARCH, VSEARCH, UPARSE and Swarm on two collections of data sets described in [*DADA2: High-resolution sample inference from Illumina amplicon data*](https://doi.org/10.1038/nmeth.3869) (Callahan et al.) and [*Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering*](https://doi.org/10.1186/s40168-015-0105-6) (Franzén et al.).
The provided files allow to repeat the evaluation by rerunning the tools or just reanalysing the results. The evaluation repository is available [here](https://github.com/romueller/gefast-qa-evaluation).
*Input files:*
**callahan_data.tar.bz2**: Read files and ground truths of the Callahan data sets used by GeFaST, USEARCH, VSEARCH, UPARSE and Swarm. This archive should be extracted in the `analyses/model_supported_callahan/`, `analyses/quality_weighted_callahan/`, `analyses/swarm_callahan/`, `analyses/uvsearch_callahan/` and `analyses/performance/` subfolder of the evaluation repository. **dada2_callahan_data.tar.bz2**: Read files and ground truths of the Callahan data sets used by DADA2. This archive should be extracted in the `analyses/dada2_callahan/` subfolder of the evaluation repository. The workflow of DADA2 differs from the other examined tools and, thus, involves slightly different ground-truth files in order to assess the quality of the reconstructed clusters. **franzen_data.tar.bz2**: Read files and ground truths of the Franzén data sets used by all tools. This archive should be extracted in the `analyses/dada2_franzen/`, `analyses/model_supported_franzen/`, `analyses/quality_weighted_franzen/`, `analyses/swarm_franzen/` and `analyses/uvsearch_franzen/` subfolder of the evaluation repository. Since the origin of *in silico* sequenced amplicons is known, the ground truths can also be used for DADA2.
*Aggregated results:*
**dada2_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of DADA2 on the Callahan data sets. This archive should be extracted in the `analyses/dada2_callahan/` subfolder of the evaluation repository. **dada2_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of DADA2 on the Franzén data sets. This archive should be extracted in the `analyses/dada2_franzen/` subfolder of the evaluation repository. **model_supported_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of the model-supported clustering and refinement methods of GeFaST on the Callahan data sets. This archive should be extracted in the `analyses/model_supported_callahan/` subfolder of the evaluation repository. **model_supported_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of the model-supported clustering and refinement methods of GeFaST on the Franzén data sets. This archive should be extracted in the `analyses/model_supported_franzen/` subfolder of the evaluation repository. **performance_evaluation.tar.bz2**: Aggregated results (clustering quality, runtime, memory consumption) of the different runs of GeFaST, USEARCH, VSEARCH, UPARSE and Swarm on the largest Callahan data set (hmp_single). This archive should be extracted in the `analyses/performance/` subfolder of the evaluation repository. **quality_weighted_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of GeFaST involving quality-weighted alignments on the Callahan data sets. This archive should be extracted in the `analyses/quality_weighted_callahan/` subfolder of the evaluation repository. **quality_weighted_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of GeFaST involving quality-weighted alignments on the Franzén data sets. This archive should be extracted in the `analyses/quality_weighted_franzen/` subfolder of the evaluation repository. **swarm_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of Swarm on the Callahan data sets. This archive should be extracted in the `analyses/swarm_callahan/` subfolder of the evaluation repository. **swarm_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of Swarm on the Franzén data sets. This archive should be extracted in the `analyses/swarm_franzen/` subfolder of the evaluation repository. **uvsearch_callahan_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of USEARCH, VSEARCH and UPARSE on the Callahan data sets. This archive should be extracted in the `analyses/uvsearch_callahan/` subfolder of the evaluation repository. **uvsearch_franzen_evaluation.tar.bz2**: Aggregated results (clustering quality, number of clusters) of the different runs of USEARCH, VSEARCH and UPARSE on the Franzén data sets. This archive should be extracted in the `analyses/uvsearch_franzen/` subfolder of the evaluation repository.
Erscheinungsjahr
2021
Copyright und Lizenzen
Page URI
https://pub.uni-bielefeld.de/record/2951742
Zitieren
Müller R, Nebel M. Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University; 2021.
Müller, R., & Nebel, M. (2021). Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University. https://doi.org/10.4119/unibi/2951742
Müller, Robert, and Nebel, Markus. 2021. Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University.
Müller, R., and Nebel, M. (2021). Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University.
Müller, R., & Nebel, M., 2021. Evaluation data for "On the use of sequence-quality information in OTU clustering", Bielefeld University.
R. Müller and M. Nebel, Evaluation data for "On the use of sequence-quality information in OTU clustering", Bielefeld University, 2021.
Müller, R., Nebel, M.: Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University (2021).
Müller, Robert, and Nebel, Markus. Evaluation data for "On the use of sequence-quality information in OTU clustering". Bielefeld University, 2021.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung-Nicht kommerziell 4.0 International (CC BY-NC 4.0):
Volltext(e)
Name
callahan_data.tar.bz2
212.94 MB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:39:25Z
MD5 Prüfsumme
ef5278068515d5c7baebf4e3d5d35f49
Name
dada2_callahan_data.tar.bz2
219.35 MB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:39:55Z
MD5 Prüfsumme
d0910082a80737803ad0698260af9d91
Name
franzen_data.tar.bz2
80.02 MB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:40:06Z
MD5 Prüfsumme
8b533b2bba640c7633a891e0ac421601
Name
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:40:23Z
MD5 Prüfsumme
9e7c61047f4b03aab0707814885e2c4f
Name
dada2_franzen_evaluation.tar.bz2
6.59 KB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:40:34Z
MD5 Prüfsumme
ec847fa4038a82f0aadfa6ae8a0adb55
Name
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:40:53Z
MD5 Prüfsumme
3dbf0466213d488bfb8a8eab87b623a0
Name
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:41:03Z
MD5 Prüfsumme
667df48fe99b3e58dcf13fee3117ac43
Name
performance_evaluation.tar.bz2
22.82 KB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:41:26Z
MD5 Prüfsumme
a6c0f781d9932e513f7bcbbfce8635e1
Name
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:41:37Z
MD5 Prüfsumme
1ffb13bda81837ec5d12cd4994690980
Name
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:41:48Z
MD5 Prüfsumme
35e5a9d2e63ff92da7f4a338fe5aa3d3
Name
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:41:54Z
MD5 Prüfsumme
91ad4a7209028bba429ad909b03fbb26
Name
swarm_franzen_evaluation.tar.bz2
37.71 KB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:42:02Z
MD5 Prüfsumme
8b312fe3d02bd994aa2874ba7a4246c7
Name
uvsearch_callahan_evaluation.tar.bz2
15.49 KB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:42:11Z
MD5 Prüfsumme
88f67d2aee0dd153788ea3970ff3c26c
Name
uvsearch_franzen_evaluation.tar.bz2
214.92 KB
Access Level
Open Access
Zuletzt Hochgeladen
2021-03-18T14:42:20Z
MD5 Prüfsumme
3ef373193d6b39a4fbb065bd6b46f659