Classification and extraction of information from ETD documents

dc.contributor.authorAromando, Johnen
dc.contributor.authorBanerjee, Bipashaen
dc.contributor.authorIngram, William A.en
dc.contributor.authorJude, Palakhen
dc.contributor.authorKahu, Sampannaen
dc.date.accessioned2020-02-01T00:37:09Zen
dc.date.available2020-02-01T00:37:09Zen
dc.date.issued2020-01-30en
dc.description.abstractIn recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs. We describe the machine learning and natural language processing tools we use for performing multi-label classification of ETD documents. We show how references from ETDs can be parsed into their component parts (e.g., title, author, date) using deep neural networks. Finally, we show that figures can be accurately extracted from a collection of born-digital and scanned ETDs using deep learning.en
dc.description.notes# Contents * ETD_report.pdf * ETD_report.zip * ETD_presentation.pdf * ETD_presentation.pptxen
dc.description.sponsorshipIMLS: LG-37-19-0078-19en
dc.identifier.urihttp://hdl.handle.net/10919/96645en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/en
dc.subjectelectronic theses and dissertationsen
dc.subjectETDsen
dc.subjectnatural language processingen
dc.subjectclassificationen
dc.subjectcitation analysisen
dc.subjectdocument layout analysisen
dc.subjectdigital librariesen
dc.titleClassification and extraction of information from ETD documentsen
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
ETD_report.pdf
Size:
15.15 MB
Format:
Adobe Portable Document Format
Name:
ETD_report.zip
Size:
19.31 MB
Format:
Loading...
Thumbnail Image
Name:
ETD_presentation.pdf
Size:
39.52 MB
Format:
Adobe Portable Document Format
Name:
ETD_presentation.pptx
Size:
11.42 MB
Format:
Microsoft Powerpoint XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: