Evaluating MapReduce System Performance: A Simulation Approach

dc.contributor.authorWang, Guanyingen
dc.contributor.committeechairButt, Ali R.en
dc.contributor.committeememberCameron, Kirk W.en
dc.contributor.committeememberFeng, Wu-chunen
dc.contributor.committeememberNikolopoulos, Dimitrios S.en
dc.contributor.committeememberPandey, Prashanten
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2014-03-14T20:15:47Zen
dc.date.adate2012-09-13en
dc.date.available2014-03-14T20:15:47Zen
dc.date.issued2012-08-27en
dc.date.rdate2012-09-13en
dc.date.sdate2012-08-28en
dc.description.abstractScale of data generated and processed is exploding in the Big Data era. The MapReduce system popularized by open-source Hadoop is a powerful tool for the exploding data problem, and is widely employed in many areas involving large scale of data. In many circumstances, hypothetical MapReduce systems must be evaluated, e.g. to provision a new MapReduce system to provide certain performance goal, to upgrade a currently running system to meet increasing business demands, to evaluate novel network topology, new scheduling algorithms, or resource arrangement schemes. The traditional trial-and-error solution involves the time-consuming and costly process in which a real cluster is first built and then benchmarked. In this dissertation, we propose to simulate MapReduce systems and evaluate hypothetical MapReduce systems using simulation. This simulation approach offers significantly lower turn-around time and lower cost than experiments. Simulation cannot entirely replace experiments, but can be used as a preliminary step to reveal potential flaws and gain critical insights. We studied MapReduce systems in detail and developed a comprehensive performance model for MapReduce, including sub-task phase level performance models for both map and reduce tasks and a model for resource contention between multiple processes running in concurrent. Based on the performance model, we developed a comprehensive simulator for MapReduce, MRPerf. MRPerf is the first full-featured MapReduce simulator. It supports both workload simulation and resource contention, and it still offers the most complete features among all MapReduce simulators to date. Using MRPerf, we conducted two case studies to evaluate scheduling algorithms in MapReduce and shared storage in MapReduce, without building real clusters. Furthermore, in order to further integrate simulation and performance prediction into MapReduce systems and leverage predictions to improve system performance, we developed online prediction framework for MapReduce, which periodically runs simulations within a live Hadoop MapReduce system. The framework can predict task execution within a window in near future. These predictions can be used by other components in MapReduce systems in order to improve performance. Our results show that the framework can achieve high prediction accuracy and incurs negligible overhead. We present two potential use cases, prefetching and dynamic adapting scheduler.en
dc.description.degreePh. D.en
dc.identifier.otheretd-08282012-152556en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-08282012-152556/en
dc.identifier.urihttp://hdl.handle.net/10919/28820en
dc.publisherVirginia Techen
dc.relation.haspartWang_G_D_2012.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectperformance predictionen
dc.subjectperformance modelingen
dc.subjectSimulationen
dc.subjectMapReduceen
dc.subjectHadoopen
dc.titleEvaluating MapReduce System Performance: A Simulation Approachen
dc.typeDissertationen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_G_D_2012.pdf
Size:
1.08 MB
Format:
Adobe Portable Document Format