ArchiveSpark - MS Independent Study Final Submission

TR Number
Date
2016-12-13
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

This project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation.

The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios.

Description
Keywords
ArchiveSpark, Big data, IDEAL, GETAR, HBase, WARC, Internet Archive, Spark, Web Archiving, ILMS, CDX
Citation