Classification and extraction of information from ETD documents

In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.

Keywords

electronic theses and dissertations, ETDs, natural language processing, classification, citation analysis, document layout analysis, digital libraries

Persistent link

http://hdl.handle.net/10919/96645

Collections

CS6604: Digital Libraries

Full item page

Classification and extraction of information from ETD documents

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections