Large Web Archive Collection Infrastructure and Services
The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently.
In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.