Large Web Archive Collection Infrastructure and Services

dc.contributor.authorWang, Xinyueen
dc.contributor.committeechairFox, Edward A.en
dc.contributor.committeechairXie, Zhiwuen
dc.contributor.committeememberLourentzou, Isminien
dc.contributor.committeememberEldardiry, Hodaen
dc.contributor.committeememberChen, Jiangpingen
dc.contributor.departmentComputer Science and Applicationsen
dc.date.accessioned2023-01-21T09:00:28Zen
dc.date.available2023-01-21T09:00:28Zen
dc.date.issued2023-01-20en
dc.description.abstractThe web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently. In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.en
dc.description.abstractgeneralThe web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information discovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently. In this research, we propose to build a web archive big data processing infrastructure that can support efficient and scalable web archive reuse like quantitative data analysis and browsing services. We adopt industry frameworks and tools to establish a platform that can provide high-performance computation for web archive initiatives and users. We propose to convert the standard web archive data file format to a columnar data format for efficient future reuse. Our experiments show that our proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. Our design can also serve an efficient web browsing service without adopting a sophisticated web hosting architecture. In addition to the standard web archive data, we also integrate Twitter data into our design as a unique web archive resource. Twitter is a prominent source of data for researchers in a variety of fields and an integral element of the web's history. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:36218en
dc.identifier.urihttp://hdl.handle.net/10919/113345en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.subjectWeb Archiveen
dc.subjectDigital Libraryen
dc.subjectBig Data Infrastructureen
dc.titleLarge Web Archive Collection Infrastructure and Servicesen
dc.typeDissertationen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_X_D_2023.pdf
Size:
3.96 MB
Format:
Adobe Portable Document Format