Data-Intensive Biocomputing in the Cloud

Meeramohideen Mohamed, Nabeel

Data-Intensive Biocomputing in the Cloud

dc.contributor.author	Meeramohideen Mohamed, Nabeel	en
dc.contributor.committeechair	Feng, Wu-chun	en
dc.contributor.committeemember	Butt, Ali R.	en
dc.contributor.committeemember	Lin, Heshan	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2013-09-26T08:00:28Z	en
dc.date.available	2013-09-26T08:00:28Z	en
dc.date.issued	2013-09-25	en
dc.description.abstract	Next-generation sequencing (NGS) technologies have made it possible to rapidly sequence the human genome, heralding a new era of health-care innovations based on personalized genetic information. However, these NGS technologies generate data at a rate that far outstrips Moore\'s Law. As a consequence, analyzing this exponentially increasing data deluge requires enormous computational and storage resources, resources that many life science institutions do not have access to. As such, cloud computing has emerged as an obvious, but still nascent, solution. This thesis intends to investigate and design an efficient framework for running and managing large-scale data-intensive scientific applications in the cloud. Based on the learning from our parallel implementation of a genome analysis pipeline in the cloud, we aim to provide a framework for users to run such data-intensive scientific workflows using a hybrid setup of client and cloud resources. We first present SeqInCloud, our highly scalable parallel implementation of a popular genetic variant pipeline called genome analysis toolkit (GATK), on the Windows Azure HDInsight cloud platform. Together with a parallel implementation of GATK on Hadoop, we evaluate the potential of using cloud computing for large-scale DNA analysis and present a detailed study on efficiently utilizing cloud resources for running data-intensive, life-science applications. Based on our experience from running SeqInCloud on Azure, we present CloudFlow, a feature rich workflow manager for running MapReduce-based bioinformatic pipelines utilizing both client and cloud resources. CloudFlow, built on the top of an existing MapReduce-based workflow manager called Cloudgene, provides unique features that are not offered by existing MapReduce-based workflow managers, such as enabling simultaneous use of client and cloud resources, automatic data-dependency handling between client and cloud resources, and the flexibility of implementing user-defined plugins for data transformations. In-general, we believe that our work attempts to increase the adoption of cloud resources for running data-intensive scientific workloads.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:1460	en
dc.identifier.uri	http://hdl.handle.net/10919/23847	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Cloud Computing	en
dc.subject	Next Generation Sequencing	en
dc.subject	MapReduce	en
dc.subject	GATK	en
dc.subject	Workflow	en
dc.title	Data-Intensive Biocomputing in the Cloud	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Meeramohideen_Mohamed_N_T_2013.pdf
Size:: 5.11 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses