CS4624: Environment - Virginia Water Resources Research Center (VWRRC) PDF Documents to VTechWorks

Abstract

Virginia Tech has many groups engaged in work related to the environment. In an effort to alleviate server strain for the Virginia Water Resources Research Center (VWRRC), we have begun to archive over 300 PDF documents into VTechWorks. This will make more than five decades of Virginia Tech’s water research more searchable and accessible than ever before. This permanent archive supports searching and browsing by issue date, author, title, subject, series, and more. It may lead to other efforts in support of the College of Natural Resources and Environment.

Description
This submission describes our efforts in moving over 300 VWRRC PDF documents to VTechWorks. We employed mostly Java code to do this, using the third-party libraries OpenCloud and JSoup for metadata tagging and procurement, respectively. Additionally, PDFBox by Apache was used to pull textual information out of PDF documents dating back to the 1970's.
Keywords
Water, links, vwrrc, pdf conversion, jsoup, opencloud, tag cloud, html parsing, resources
Citation