IDEAL Pages

Abstract

The purpose of this project was to simplify the process of unzipping web archive files, parsing them, and indexing them into a Solr instance. These tasks are only a subset of tasks that are part of a greater project called IDEAL. The main goal of that project is to build an 11-node cluster that will take roughly 10TB of webpages collected from various events, and ingest, filter, analyze, and provide convenient access to these webpages through a user interface. The IDEAL Pages portion of this project is critical to the overall success of the mission. In order to provide desired services with the data, the data needs to be successfully delivered to Solr. However, working with nearly 10TB of compressed data proves to be a rather difficult task. Our primary objective was to create a Python script to convert all of the web archive files into raw text files containing the text found on these web pages. Through the use of multiple existing tools, and software developed in the process, the task was accomplished. A tool for working with web archive files called Hanzo Warc Tools was incorporated into the Python script to unpack the files. To extract the text from the HTML we made use of a program called Beautiful Soup which was able to create files that could be easily indexed into Solr. A key element to making the IDEAL project timely, however, is to distribute this process through the use of Hadoop. Hadoop grants the ability to run the same process on multiple machines concurrently to effectively reduce the overall runtime of a task. To accomplish this, it was necessary to split the script into multiple pieces and run them through Hadoop with the use of Map/Reduce. Using one of the IDEAL Project machines and Cloudera, it was possible to work with Hadoop and Map/Reduce. The outcome of this project resulted in an efficient way to process web archive files and remains extendable to optimize distribution of the tasks involved.

Description
The IDEAL Pages project is part of the IDEAL (NSF IIS - 1319578: Integrated Digital Event Archiving and Library) project and specifically focuses on the processing and indexing of Web archives.
Keywords
Digital Archives, Solr, Hadoop, Web Archives, IDEAL Project
Citation