Data-Intensive Biocomputing in the Cloud

dc.contributor.authorMeeramohideen Mohamed, Nabeelen
dc.contributor.committeechairFeng, Wu-chunen
dc.contributor.committeememberButt, Ali R.en
dc.contributor.committeememberLin, Heshanen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2013-09-26T08:00:28Zen
dc.date.available2013-09-26T08:00:28Zen
dc.date.issued2013-09-25en
dc.description.abstractNext-generation sequencing (NGS) technologies have made it possible to rapidly sequence the human genome, heralding a new era of health-care innovations based on personalized genetic information. However, these NGS technologies generate data at a rate that far outstrips Moore\'s Law. As a consequence, analyzing this exponentially increasing data deluge requires enormous computational and storage resources, resources that many life science institutions do not have access to. As such, cloud computing has emerged as an obvious, but still nascent, solution. This thesis intends to investigate and design an efficient framework for running and managing large-scale data-intensive scientific applications in the cloud. Based on the learning from our parallel implementation of a genome analysis pipeline in the cloud, we aim to provide a framework for users to run such data-intensive scientific workflows using a hybrid setup of client and cloud resources. We first present SeqInCloud, our highly scalable parallel implementation of a popular genetic variant pipeline called genome analysis toolkit (GATK), on the Windows Azure HDInsight cloud platform. Together with a parallel implementation of GATK on Hadoop, we evaluate the potential of using cloud computing for large-scale DNA analysis and present a detailed study on efficiently utilizing cloud resources for running data-intensive, life-science applications. Based on our experience from running SeqInCloud on Azure, we present CloudFlow, a feature rich workflow manager for running MapReduce-based bioinformatic pipelines utilizing both client and cloud resources. CloudFlow, built on the top of an existing MapReduce-based workflow manager called Cloudgene, provides unique features that are not offered by existing MapReduce-based workflow managers, such as enabling simultaneous use of client and cloud resources, automatic data-dependency handling between client and cloud resources, and the flexibility of implementing user-defined plugins for data transformations. In-general, we believe that our work attempts to increase the adoption of cloud resources for running data-intensive scientific workloads.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:1460en
dc.identifier.urihttp://hdl.handle.net/10919/23847en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectCloud Computingen
dc.subjectNext Generation Sequencingen
dc.subjectMapReduceen
dc.subjectGATKen
dc.subjectWorkflowen
dc.titleData-Intensive Biocomputing in the Clouden
dc.typeThesisen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Meeramohideen_Mohamed_N_T_2013.pdf
Size:
5.11 MB
Format:
Adobe Portable Document Format

Collections