Efficient Web Archive Searching

Abstract

The field of efficient web archive searching is at a turning point. In the early years of web archive searching, the organizations only use the URL as a key to search through the dataset, which is inefficient but acceptable. In recent years, as the volume of data in web archives has grown larger and larger, the ordinary searching methods have been gradually replaced by more efficient searching methods. This project will address the theoretical and methodological implications of choosing and running some suitable hashing algorithms locally, and eventually to improve the whole performance of web archive searching in time complexity. At the same time, our project introduces the design and implementation of various hashing algorithms to convert URLs to a sortable and shortened format, as well as demonstrates the corresponding searching efficiency improvement with benchmark results.

Description
Keywords
Internet Archive, Short URL, Web Archive, searching efficiency, WARC records, Database, Digital Library
Citation