IDEAL Pages

Aly, Mustafa; Gulotta, Gasper

IDEAL Pages

dc.contributor.author	Aly, Mustafa	en
dc.contributor.author	Gulotta, Gasper	en
dc.date.accessioned	2014-05-09T21:01:10Z	en
dc.date.available	2014-05-09T21:01:10Z	en
dc.date.issued	2014-05-09	en
dc.description	The IDEAL Pages project is part of the IDEAL (NSF IIS - 1319578: Integrated Digital Event Archiving and Library) project and specifically focuses on the processing and indexing of Web archives.	en
dc.description.abstract	The purpose of this project was to simplify the process of unzipping web archive files, parsing them, and indexing them into a Solr instance. These tasks are only a subset of tasks that are part of a greater project called IDEAL. The main goal of that project is to build an 11-node cluster that will take roughly 10TB of webpages collected from various events, and ingest, filter, analyze, and provide convenient access to these webpages through a user interface. The IDEAL Pages portion of this project is critical to the overall success of the mission. In order to provide desired services with the data, the data needs to be successfully delivered to Solr. However, working with nearly 10TB of compressed data proves to be a rather difficult task. Our primary objective was to create a Python script to convert all of the web archive files into raw text files containing the text found on these web pages. Through the use of multiple existing tools, and software developed in the process, the task was accomplished. A tool for working with web archive files called Hanzo Warc Tools was incorporated into the Python script to unpack the files. To extract the text from the HTML we made use of a program called Beautiful Soup which was able to create files that could be easily indexed into Solr. A key element to making the IDEAL project timely, however, is to distribute this process through the use of Hadoop. Hadoop grants the ability to run the same process on multiple machines concurrently to effectively reduce the overall runtime of a task. To accomplish this, it was necessary to split the script into multiple pieces and run them through Hadoop with the use of Map/Reduce. Using one of the IDEAL Project machines and Cloudera, it was possible to work with Hadoop and Map/Reduce. The outcome of this project resulted in an efficient way to process web archive files and remains extendable to optimize distribution of the tasks involved.	en
dc.description.sponsorship	Mohamed Magdy	en
dc.description.sponsorship	NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL).	en
dc.identifier.uri	http://hdl.handle.net/10919/47938	en
dc.language.iso	en_US	en
dc.rights	Creative Commons Attribution-NonCommercial 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	en
dc.subject	Digital Archives	en
dc.subject	Solr	en
dc.subject	Hadoop	en
dc.subject	Web Archives	en
dc.subject	IDEAL Project	en
dc.title	IDEAL Pages	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en

Files

Original bundle

Now showing 1 - 5 of 9

Name:: FinalPresAlyAndGulotta.pdf
Size:: 446.77 KB
Format:: Adobe Portable Document Format
Description:: PDF Form of Final Presentation for project conclusion

Download

Name:: FinalPresAlyAndGulotta.pptx
Size:: 444.9 KB
Format:: Microsoft Powerpoint XML
Description:: PPT Form of Final Presentation for project conclusion

Download

Name:: MidtermPresAlyAndGulotta.pdf
Size:: 111.89 KB
Format:: Adobe Portable Document Format
Description:: PDF Form of Midterm Presentation for providing a project update

Download

Name:: MidtermPresAlyAndGulotta.pptx
Size:: 43.87 KB
Format:: Microsoft Powerpoint XML
Description:: PPT Form of Midterm Presentation for providing a project update

Download

Name:: processWarcDir.py
Size:: 4.77 KB
Format:: Unknown data format
Description:: This Python script handles the processing of a directory containing Web archives (WARC files). WARC files are decompressed, their HTML files are identified, then the text of each HTML file is extracted and indexed into a Solr instance. See the User's Manual in the Final Report on how to execute this script.

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

CS4624: Multimedia, Hypertext, and Information Access