CS5604 Fall 2020: Electronic Thesis and Dissertation (ETD) Team

The Fall 2020 CS 5604 (Information Storage and Retrieval) class, led by Dr. Edward Fox, is building an information retrieval and analysis system that supports electronic theses and dissertations, tweets, and webpages. We are the Electronic Thesis and Dissertation Collection Management (ETD) team. The Virginia Tech Library maintains a corpus of 19,779 theses and 14,691 dissertations within the VTechWorks system. Our task was to research this data, implement data preprocessing and extraction techniques, index the data using Elasticsearch, and use machine learning techniques for each ETD's classification. These efforts were made in concert with teams working to process other content areas, and build content agnostic infrastructure. Prior work towards these tasks had been done in previous semesters of CS5604, and by students advised by Dr. Fox. That prior work serves as a foundation for our own work. Items of note were the works of Sampanna Kahu, Palakh Jude, and the Fall 2019 CS5604 CME team, which have been used either as part of our pipeline, or as the starting point for our work. Our team divided the creation of an ETD IR system into five subtasks: verify metadata of past teams, ingest text, index using Elasticsearch, extract figures and tables, and classify full documents and chapters. Each member of the team was assigned to a role, and a timeline was created to keep everyone on track. Text ingestion was done via ITCore and is accessible through an API. Our team did not perform any base metadata extraction since the Fall 2019 CS5604 CME had already done so, however we did still verify the quality of the data. Verification was done by hand and showed that most of the errors found in metadata from previous semesters were minor, but there were a few errors that could have lead to misclassification. However, since those major errors were few and far between, we decided that given the length of the project we could continue to use this metadata and added improved metadata extraction to our future goals. For figure and text extraction, we incorporated the work of Kahu. For classification, we first implemented the work of Jude, who has done previous work related to chapter classification. In addition, we created another classifier that is more accurate. Both methods are available as accessible APIs. The latter classifier is also available as a microservice. In addition, an Elasticsearch service was created as a single point of contact between the pipeline and Elasticsearch. It acts as the final part of the pipeline; the processing is complete when the document is indexed into Elasticsearch. The final deliverable is a pipeline of containerized services that can be used to store and index ETDs. Staging of the pipeline was handled by the integration team using Airflow and a reasoner engine to control resource management and data flow. The entire pipeline is then accessible via a website created by the frontend team. Given that Airflow defines the application pipeline based on dependencies between services and our chapter extraction service failed to build due to MySQL dependencies, we were unable to deploy an end-to-end Airflow system. All other services have been unit tested on Git Runner, containerized, and deployed to cloud.cs.vt.edu following a CI/CD pipeline. Future work includes expanding the available models for classification; expanding the available options for the extraction of text, figures, tables, and chapters; and adding more features that may be useful to researchers who would be interested in leveraging this pipeline. Another improvement would be to tackle some of the errors in metadata, such as that from previous teams.

Keywords

classification, deep learning, Elasticsearch indexing, electronic theses and dissertations, ETD, metadata retrieval, natural language processing

Persistent link

http://hdl.handle.net/10919/101511

Collections

CS5604: Information Retrieval

Full item page