VTechWorks staff will be away for the winter holidays starting Tuesday, December 24, 2024, through Wednesday, January 1, 2025, and will not be replying to requests during this time. Thank you for your patience, and happy holidays!
 

Classification and extraction of information from ETD documents

TR Number

Date

2020-01-30

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.

Description

Keywords

electronic theses and dissertations, ETDs, natural language processing, classification, citation analysis, document layout analysis, digital libraries

Citation