ArchiveSpark - MS Independent Study Final Submission
dc.contributor.author | Galad, Andrej | en |
dc.contributor.department | Digital Library Research Laboratory | en |
dc.contributor.department | Computer Science | en |
dc.date.accessioned | 2017-04-23T19:35:57Z | en |
dc.date.available | 2017-04-23T19:35:57Z | en |
dc.date.issued | 2016-12-13 | en |
dc.description.abstract | This project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation. The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios. | en |
dc.description.notes | Included are the final report (PDF + Word), the final presentation (PPTX + PDF), the ArchiveSpark demo in the form of Jupyter Notebook, and the software developed during this project. | en |
dc.description.sponsorship | IMLS LG-71-16-0037-16: Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse | en |
dc.description.sponsorship | NSF IIS-1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR) | en |
dc.description.sponsorship | NSF IIS - 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL) | en |
dc.identifier.uri | http://hdl.handle.net/10919/77457 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | ArchiveSpark | en |
dc.subject | Big data | en |
dc.subject | IDEAL | en |
dc.subject | GETAR | en |
dc.subject | HBase | en |
dc.subject | WARC | en |
dc.subject | Internet Archive | en |
dc.subject | Spark | en |
dc.subject | Web Archiving | en |
dc.subject | ILMS | en |
dc.subject | CDX | en |
dc.title | ArchiveSpark - MS Independent Study Final Submission | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 6
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: