This Python script handles the processing of a directory containing Web archives (WARC files). WARC files are decompressed, their HTML files are identified, then the text of each HTML file is extracted and indexed into a Solr instance. See the User's Manual in the Final Report on how to execute this script. (4.772Kb)
A Mapper script part of the MapReduce framework that reads the contents of a Web archive's index and prints to standard output the HTML files contained. (194bytes)
A Mapper script part of the MapReduce framework that reads a list of file locations for HMTL files, then extracts the text from the files, and finally indexes them into a Solr instance. (794bytes)
MetadataShow full item record
The purpose of this project was to simplify the process of unzipping web archive files, parsing them, and indexing them into a Solr instance. These tasks are only a subset of tasks that are part of a greater project called IDEAL. The main goal of that project is to build an 11-node cluster that will take roughly 10TB of webpages collected from various events, and ingest, filter, analyze, and provide convenient access to these webpages through a user interface. The IDEAL Pages portion of this project is critical to the overall success of the mission. In order to provide desired services with the data, the data needs to be successfully delivered to Solr. However, working with nearly 10TB of compressed data proves to be a rather difficult task. Our primary objective was to create a Python script to convert all of the web archive files into raw text files containing the text found on these web pages. Through the use of multiple existing tools, and software developed in the process, the task was accomplished. A tool for working with web archive files called Hanzo Warc Tools was incorporated into the Python script to unpack the files. To extract the text from the HTML we made use of a program called Beautiful Soup which was able to create files that could be easily indexed into Solr. A key element to making the IDEAL project timely, however, is to distribute this process through the use of Hadoop. Hadoop grants the ability to run the same process on multiple machines concurrently to effectively reduce the overall runtime of a task. To accomplish this, it was necessary to split the script into multiple pieces and run them through Hadoop with the use of Map/Reduce. Using one of the IDEAL Project machines and Cloudera, it was possible to work with Hadoop and Map/Reduce. The outcome of this project resulted in an efficient way to process web archive files and remains extendable to optimize distribution of the tasks involved.