DLRL Cluster

TR Number
Date
2014-05-09
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

The Digital Library Research Laboratory is a group focused on researching and implementing a full stack Hadoop cluster for data storage and analysis. The DLRL Cluster project is focused on learning and teaching the technologies behind the cluster itself. To accomplish this, we were given three primary goals.

First, we were to create tutorials to teach new users how to use Mahout, HBase, Hive, and Impala. The idea was to have basic tutorials that would provide users with an introductory coverage of these modern technologies, including what they are, what they’re used for, and a fundamental level of how they’re used. The first goal was met by creating an in-depth tutorial for each technology. Each tutorial contains step-by-step instructions on how to get started with each technology, along with pictures that allow users to follow along and compare their progress to ensure that they are successful.

Second, we would use these tools to demonstrate their capabilities on real data from the IDEAL project. Rather than have to show a demo to each new user of the system firsthand, we created a short (5 to 10 minute) demo video for each technology. This way users could see for themselves how to go about utilizing the software to accomplish tasks. With a video, users are able to pause and go back at their leisure to better familiarize themselves with the commands and interfaces involved.

Finally, we would utilize the knowledge gained from researching these technologies and apply it to the actual cluster. We took a real, large, dataset from the DLRL cluster and ran it through each respective technology. Some reports were generated, focusing on efficiency and performance, and an actual result dataset was generated for some data analysis.

Description
Demonstration and comparison of capabilities of Hadoop tools: HBase, Hive, Impala, Mahout. Contains tutorials and video demos. This project was possible thanks to National Science Foundation IIS - 1319578 support of the Integrated Digital Event Archiving and Library (IDEAL) grant.
Keywords
Mahout, Impala, HBase, Hive, Hadoop, IDEAL
Citation