IDEAL Pages

dc.contributor.authorAly, Mustafaen
dc.contributor.authorGulotta, Gasperen
dc.date.accessioned2014-05-09T21:01:10Zen
dc.date.available2014-05-09T21:01:10Zen
dc.date.issued2014-05-09en
dc.descriptionThe IDEAL Pages project is part of the IDEAL (NSF IIS - 1319578: Integrated Digital Event Archiving and Library) project and specifically focuses on the processing and indexing of Web archives.en
dc.description.abstractThe purpose of this project was to simplify the process of unzipping web archive files, parsing them, and indexing them into a Solr instance. These tasks are only a subset of tasks that are part of a greater project called IDEAL. The main goal of that project is to build an 11-node cluster that will take roughly 10TB of webpages collected from various events, and ingest, filter, analyze, and provide convenient access to these webpages through a user interface. The IDEAL Pages portion of this project is critical to the overall success of the mission. In order to provide desired services with the data, the data needs to be successfully delivered to Solr. However, working with nearly 10TB of compressed data proves to be a rather difficult task. Our primary objective was to create a Python script to convert all of the web archive files into raw text files containing the text found on these web pages. Through the use of multiple existing tools, and software developed in the process, the task was accomplished. A tool for working with web archive files called Hanzo Warc Tools was incorporated into the Python script to unpack the files. To extract the text from the HTML we made use of a program called Beautiful Soup which was able to create files that could be easily indexed into Solr. A key element to making the IDEAL project timely, however, is to distribute this process through the use of Hadoop. Hadoop grants the ability to run the same process on multiple machines concurrently to effectively reduce the overall runtime of a task. To accomplish this, it was necessary to split the script into multiple pieces and run them through Hadoop with the use of Map/Reduce. Using one of the IDEAL Project machines and Cloudera, it was possible to work with Hadoop and Map/Reduce. The outcome of this project resulted in an efficient way to process web archive files and remains extendable to optimize distribution of the tasks involved.en
dc.description.sponsorshipMohamed Magdyen
dc.description.sponsorshipNSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL).en
dc.identifier.urihttp://hdl.handle.net/10919/47938en
dc.language.isoen_USen
dc.rightsCreative Commons Attribution-NonCommercial 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/en
dc.subjectDigital Archivesen
dc.subjectSolren
dc.subjectHadoopen
dc.subjectWeb Archivesen
dc.subjectIDEAL Projecten
dc.titleIDEAL Pagesen
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 9
Loading...
Thumbnail Image
Name:
FinalPresAlyAndGulotta.pdf
Size:
446.77 KB
Format:
Adobe Portable Document Format
Description:
PDF Form of Final Presentation for project conclusion
Name:
FinalPresAlyAndGulotta.pptx
Size:
444.9 KB
Format:
Microsoft Powerpoint XML
Description:
PPT Form of Final Presentation for project conclusion
Loading...
Thumbnail Image
Name:
MidtermPresAlyAndGulotta.pdf
Size:
111.89 KB
Format:
Adobe Portable Document Format
Description:
PDF Form of Midterm Presentation for providing a project update
Name:
MidtermPresAlyAndGulotta.pptx
Size:
43.87 KB
Format:
Microsoft Powerpoint XML
Description:
PPT Form of Midterm Presentation for providing a project update
Name:
processWarcDir.py
Size:
4.77 KB
Format:
Unknown data format
Description:
This Python script handles the processing of a directory containing Web archives (WARC files). WARC files are decompressed, their HTML files are identified, then the text of each HTML file is extracted and indexed into a Solr instance. See the User's Manual in the Final Report on how to execute this script.
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: