Increasing Accessibility of Electronic Theses and Dissertations (ETDs) Through Chapter-level Classification

Files

TR Number

Date

2020-07-07

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Great progress has been made to leverage the improvements made in natural language processing and machine learning to better mine data from journals, conference proceedings, and other digital library documents. However, these advances do not extend well to book-length documents such as electronic theses and dissertations (ETDs). ETDs contain extensive research data; stakeholders -- including researchers, librarians, students, and educators -- can benefit from increased access to this corpus. Challenges arise while working with this corpus owing to the varied nature of disciplines covered as well as the use of domain-specific language. Prior systems are not tuned to this corpus. This research aims to increase the accessibility of ETDs by the automatic classification of chapters of an ETD using machine learning and deep learning techniques. This work utilizes an ETD-centric target classification system. It demonstrates the use of custom trained word and document embeddings to generate better vector representations of this corpus. It also describes a methodology to leverage extractive summaries of chapters of an ETD to aid in the classification process. Our findings indicate that custom embeddings and the use of summarization techniques can increase the performance of the classifiers. The chapter-level labels generated by this research help to identify the level of interdisciplinarity in the corpus. The automatic classifiers can also be further used in a search engine interface that would help users to find the most appropriate chapters.

Description

Keywords

Electronic Theses and Dissertations, Classification, Machine learning, Deep learning (Machine learning), Natural Language Processing

Citation

Collections