CS4624 IDEAL Spreadsheet

Abstract

The IDEAL proposal encompasses an incredibly vast infrastructure of technology intended to be used by people of varying backgrounds. The analysts and researchers who will be familiar with the data presented through many aspects of the IDEAL project may not be familiar with the means of accessing it from the differing resources. The purpose of this project is to provide non technically-skilled personnel with the ability to access data in a easy to use and intuitive way.

The data this project focuses on are tweets, photos, and webpages found on web-archive files, or ‘warc’ files. These warc files are comprised of a few, to several hundreds of gigabytes, making a manual search to find specific information near impossible. Instead, we use a Cloudera VM as a prototype of the cluster used in IDEAL, and demonstrate how to load WARC files for Hadoop processing. That allows parallel big data processing with several software tools, supporting database and full-text searching, text extraction, and various machine learning applications.

Our project goal to present relevant data in an attractive, useful, and intuitive way was achieved through the creation of a web based spreadsheet-like service. While the exact use goes on in greater detail below, the overarching plan was to provide the user with an easy to use spreadsheet, which takes input from the user and returns the relevant data in spreadsheet cells. The other functionality requested by the client for special jobs such as ‘all images’ or ‘word count’ led to other features.

To summarize, this project intends to provide a web service to provide IDEAL researchers with the means to retrieve relevant information from warc files in an intuitive and effective manner. The project called for several technologies and frameworks which will be elaborated on below, and this project paves the way for increased future development in the IDEAL project mission.

Description

Connecting the IDEAL database to a spreadsheet interface. Source code developed is in the zip file provided. Our clients were Mohamed Magdy, a Ph. D. student at Virginia Polytechnic Institute and State University, and the Integrated Digital Event Archiving and Library (IDEAL) Team, supported through NSF IIS - 1319578.

Keywords

IDEAL, Spreadsheet, CS4624, Hadoop

Citation