Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc.
The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.