VT Web Archive Project
Files
TR Number
Date
2014-05-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
VTWebArchive is a project to archive, organize, and make available to the public, historical back-versions of content hosted on vt.edu domains. This system incorporates several open source software packages to design a publicly utilizable tool for searching and discovering historical versions of content hosted on Virginia Tech websites. These tools include Heritrix, a highly customizable spider and crawler, as well as the Apache Tomcat webserver system and the Wayback Machine front-end.
Description
In addition to the report and presentation files, included in this repository is a Heritrix configuration file, 'Heritrix Configuration.xml'. This file contains a customized configuration for crawling the VT.edu domain.
Support has been provided through:
1) Virginia Tech's Information Technology organization;
2) Qatar National Research Fund Project No. NPRP 4-029-1-007;
3) NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL)
Keywords
Archive, Internet archive, Heritrix, Wayback, Crawl, Crawler, wayback machine, WARC, Website archive, vt.edu, IDEAL, Qatar