De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing
Kang X (2024)
Bielefeld: Universität Bielefeld.
Bielefelder E-Dissertation | Englisch
Download
Dissertation Xiongbin Kang.pdf
12.14 MB
Autor*in
Gutachter*in / Betreuer*in
Einrichtung
Abstract / Bemerkung
In recent decades, the rapid progress of DNA sequencing technologies has greatly propelled metagenomics research. However, a critical bottleneck in current bioinformatics investigations is the assembly of metagenome at the strain-level resolution. This is indispensable for gaining profound insights into microbial communities and addressing pivotal biological inquiries. Despite the availability of existing assembly tools, certain challenges remain unresolved. Within this thesis, I have delineated four crucial research questions: Firstly, do different 16S rRNA gene analysis workflows significantly influence the final biological conclusions? Secondly, how can we improve the resolution of metagenome assembly to retain strain-specific variations with the next generation sequencing (NGS) data? Thirdly, how can we effectively correct sequencing errors in third-generation sequencing (TGS) reads by leveraging NGS reads? Finally, how can we better integrate the strengths of NGS and TGS to enable strain-aware metagenome assembly?
To evaluate the robustness of mainstream bioinformatics pipelines in analyzing 16S rRNA gene amplicon data, we applied four commonly used pipelines (VSEARCH, USEARCH, mothur, and UNOISE3) to process five public datasets on microbiota changes related to smoking, oral cancer, dental caries, diabetes, and periodontitis. Although the pipelines showed overall similar outcomes, variations in P-values sometimes led to inconsistent biological conclusions.
To retain strain-specific variants in metagenomic assembly, we have developed a new de novo assembler called StrainXpress that utilizes an overlap graph (OG) approach to reconstruct microbial genomes at strain-level resolution. Evaluation on benchmark datasets shows StrainXpress outperforms current state-of-the-art methods, demonstrating an average 26.75% improvement in reconstructing strain-specific sequences.
To correct sequencing errors in error-prone long reads, we have developed a hybrid correction method called HERO. It aligns high-quality short reads to long reads, and constructs an overlap graph to correct the long reads. On benchmark datasets, HERO shows superior performance over other leading methods, achieving average improvements of 65% (27~95%) in indel error correction and 20% (4~61%) in mismatch error correction.
Furthermore, we have developed HyLight, a strain-aware hybrid assembly approach that utilizes a strain-resolved OG to accurately reconstruct individual strain genomes within a microbial community. Across all datasets, HyLight reconstructs 19.93% more strain-specific variants on average than current state-of-the-art approaches.
To evaluate the robustness of mainstream bioinformatics pipelines in analyzing 16S rRNA gene amplicon data, we applied four commonly used pipelines (VSEARCH, USEARCH, mothur, and UNOISE3) to process five public datasets on microbiota changes related to smoking, oral cancer, dental caries, diabetes, and periodontitis. Although the pipelines showed overall similar outcomes, variations in P-values sometimes led to inconsistent biological conclusions.
To retain strain-specific variants in metagenomic assembly, we have developed a new de novo assembler called StrainXpress that utilizes an overlap graph (OG) approach to reconstruct microbial genomes at strain-level resolution. Evaluation on benchmark datasets shows StrainXpress outperforms current state-of-the-art methods, demonstrating an average 26.75% improvement in reconstructing strain-specific sequences.
To correct sequencing errors in error-prone long reads, we have developed a hybrid correction method called HERO. It aligns high-quality short reads to long reads, and constructs an overlap graph to correct the long reads. On benchmark datasets, HERO shows superior performance over other leading methods, achieving average improvements of 65% (27~95%) in indel error correction and 20% (4~61%) in mismatch error correction.
Furthermore, we have developed HyLight, a strain-aware hybrid assembly approach that utilizes a strain-resolved OG to accurately reconstruct individual strain genomes within a microbial community. Across all datasets, HyLight reconstructs 19.93% more strain-specific variants on average than current state-of-the-art approaches.
Jahr
2024
Seite(n)
142
Urheberrecht / Lizenzen
Page URI
https://pub.uni-bielefeld.de/record/2991432
Zitieren
Kang X. De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing. Bielefeld: Universität Bielefeld; 2024.
Kang, X. (2024). De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing. Bielefeld: Universität Bielefeld.
Kang, Xiongbin. 2024. De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing. Bielefeld: Universität Bielefeld.
Kang, X. (2024). De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing. Bielefeld: Universität Bielefeld.
Kang, X., 2024. De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing, Bielefeld: Universität Bielefeld.
X. Kang, De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing, Bielefeld: Universität Bielefeld, 2024.
Kang, X.: De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing. Universität Bielefeld, Bielefeld (2024).
Kang, Xiongbin. De Novo Assembly of Genomes in Microbiomes: From Amplicons to Whole-Genome Shotgun Sequencing. Bielefeld: Universität Bielefeld, 2024.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International Public License (CC BY-SA 4.0):
Volltext(e)
Name
Dissertation Xiongbin Kang.pdf
12.14 MB
Access Level
Open Access
Zuletzt Hochgeladen
2024-07-21T19:45:17Z
MD5 Prüfsumme
96d045caedacb5ee7ea829352089848c