The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

dc.contributor.authorWang, Xinyueen
dc.contributor.authorXie, Zhiwuen
dc.date.accessioned2020-05-27T14:34:22Zen
dc.date.available2020-05-27T14:34:22Zen
dc.date.issued2020-08en
dc.description.abstractThe WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.en
dc.description.sponsorshipIMLS: LG-71-16-0037-16en
dc.description.sponsorshipNSF: IIS-1619028en
dc.description.sponsorshipNSF: IIS-1619371en
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1145/3383583.3398542en
dc.identifier.urihttp://hdl.handle.net/10919/98565en
dc.language.isoenen
dc.publisherACMen
dc.relation.ispartofACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20)en
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivatives 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en
dc.subjectweb archivingen
dc.subjectfile formaten
dc.subjectstorage managementen
dc.subjectbig data analysisen
dc.titleThe Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycleen
dc.typeArticle - Refereeden
dc.typeConference proceedingen
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
fp210.pdf
Size:
901.38 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: