Show simple item record

dc.contributor.authorAromando, Johnen
dc.contributor.authorBanerjee, Bipashaen
dc.contributor.authorIngram, William A.en
dc.contributor.authorJude, Palakhen
dc.contributor.authorKahu, Sampannaen
dc.date.accessioned2020-02-01T00:37:09Zen
dc.date.available2020-02-01T00:37:09Zen
dc.date.issued2020-01-30en
dc.identifier.urihttp://hdl.handle.net/10919/96645en
dc.description.abstractIn recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.en
dc.description.sponsorshipIMLS: LG-37-19-0078-19en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/en
dc.subjectelectronic theses and dissertationsen
dc.subjectETDsen
dc.subjectnatural language processingen
dc.subjectclassificationen
dc.subjectcitation analysisen
dc.subjectdocument layout analysisen
dc.subjectdigital librariesen
dc.titleClassification and extraction of information from ETD documentsen
dc.typePresentationen
dc.typeReporten
dc.description.notes# Contents * ETD_report.pdf * ETD_report.zip * ETD_presentation.pdf * ETD_presentation.pptxen


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Creative Commons Attribution-ShareAlike 3.0 United States
License: Creative Commons Attribution-ShareAlike 3.0 United States