ArchiveSpark - MS Independent Study Final Submission

dc.contributor.authorGalad, Andrejen
dc.contributor.departmentDigital Library Research Laboratoryen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2017-04-23T19:35:57Zen
dc.date.available2017-04-23T19:35:57Zen
dc.date.issued2016-12-13en
dc.description.abstractThis project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation. The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios.en
dc.description.notesIncluded are the final report (PDF + Word), the final presentation (PPTX + PDF), the ArchiveSpark demo in the form of Jupyter Notebook, and the software developed during this project.en
dc.description.sponsorshipIMLS LG-71-16-0037-16: Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuseen
dc.description.sponsorshipNSF IIS-1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR)en
dc.description.sponsorshipNSF IIS - 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL)en
dc.identifier.urihttp://hdl.handle.net/10919/77457en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectArchiveSparken
dc.subjectBig dataen
dc.subjectIDEALen
dc.subjectGETARen
dc.subjectHBaseen
dc.subjectWARCen
dc.subjectInternet Archiveen
dc.subjectSparken
dc.subjectWeb Archivingen
dc.subjectILMSen
dc.subjectCDXen
dc.titleArchiveSpark - MS Independent Study Final Submissionen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 6
Name:
ArchiveSpark.zip
Size:
2.22 MB
Format:
Name:
ArchiveSpark_Demo.ipynb
Size:
20.86 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
ArchiveSpark-FINAL.pdf
Size:
920.33 KB
Format:
Adobe Portable Document Format
Name:
ArchiveSpark-FINAL.docx
Size:
917.58 KB
Format:
Microsoft Word XML
Name:
ArchiveSpark.pptx
Size:
669.73 KB
Format:
Microsoft Powerpoint XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: