Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data

Huang, Liren

Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data

Huang L (2019)
Bielefeld: Universität Bielefeld.

Bielefelder E-Dissertation | Englisch

Download

Thesis_Liren_Huang.pdf 11.23 MB

DOI

https://doi.org/10.4119/unibi/2936599

URN

urn:nbn:de:0070-pub-29365999

Autor*in

Huang, Liren^UniBi

Gutachter*in / Betreuer*in

Sczyrba, Alexander^UniBi

Einrichtung

Technische Fakultät > Int. Graduiertenkolleg DiDy (GRK 1906)
Technische Fakultät > Computational Metagenomics

Abstract / Bemerkung

The increasing amount of next-generation sequencing data introduces a fundamental challenge on large scale genomic analytics. Storing and processing large amounts of sequencing data requires considerable hardware resources and efficient software that can fully utilize these resources. Nowadays, both industrial enterprises and nonprofit institutes are providing robust and easy-access cloud services for studies in life science. To facilitate genomic data analyses on such powerful computing resources, distributed bioinformatics tools are needed. However, most of existing tools have low scalability on the distributed computing cloud. Thus, in this thesis, I developed a cloud based bioinformatics framework that mainly addresses two computational challenges: (i) the run time intensive challenge in the sequence mapping process and (ii) the memory intensive challenge in the de novo genome assembly process.

For sequence mapping, I have natively implemented an Apache Spark based distributed sequence mapping tool called Sparkhit. It uses the q-gram filter and Pigeonhole principle to accelerate the speeds of fragment recruitment and short read mapping processes. These algorithms are implemented in the Spark extended MapReduce model. Sparkhit runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing.

For de novo genome assembly, I have invented a new data structure called Reflexible Distributed K-mer (RDK) and natively implemented a distributed genome assembler called Reflexiv. Reflexiv is built on top of the Apache Spark platform, uses Spark Resilient Distributed Dataset (RDD) to distributed large amount of k-mers across the cluster and assembles the genome in a recursive way. As a result, Reflexiv runs 8-17 times faster than Ray assembler and 5-18 times faster than AbySS assembler on the clusters deployed at the de.NBI cloud.

In addition, I have incorporated a variety of analytical methods into the framework. I have also developed a tool wrapper to distribute external tools and Docker containers on the Spark cluster. As a large scale genomic use case, my framework processed 100 terabytes of data across four genomic projects on the Amazon cloud in 21 hours. Furthermore, the application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 hours, presenting an approach to easily associate large amounts of public datasets with reference data.

Thus, my work contributes to the interdisciplinary research of life science and distributed cloud computing by improving existing methods with a new data structure, new algorithms, and robust distributed implementations.

Jahr

2019

Urheberrecht / Lizenzen

Creative Commons Public Domain Dedication (CC0 1.0)

Page URI

https://pub.uni-bielefeld.de/record/2936599

Zitieren

Huang L. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld; 2019.

Huang, L. (2019). Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld. https://doi.org/10.4119/unibi/2936599

Huang, Liren. 2019. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld.

Huang, L. (2019). Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld.

Huang, L., 2019. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data, Bielefeld: Universität Bielefeld.

L. Huang, Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data, Bielefeld: Universität Bielefeld, 2019.

Huang, L.: Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Universität Bielefeld, Bielefeld (2019).

Huang, Liren. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld, 2019.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Creative Commons Public Domain Dedication (CC0 1.0):

https://creativecommons.org/publicdomain/zero/1.0/deed.de
https://creativecommons.org/publicdomain/zero/1.0/legalcode

Volltext(e)

Name

Thesis_Liren_Huang.pdf 11.23 MB

Access Level

Open Access

Zuletzt Hochgeladen

2019-09-06T09:19:08Z

MD5 Prüfsumme

fcb2883d5cf01274dbf92e10755cccf7

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld

Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data

Zitieren