Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

dc.contributor.authorChoudhury, Muntabiren
dc.contributor.authorJayanetti, Himarsha R.en
dc.contributor.authorWu, Jianen
dc.contributor.authorIngram, William A.en
dc.contributor.authorFox, Edwarden
dc.date.accessioned2024-01-22T13:06:34Zen
dc.date.available2024-01-22T13:06:34Zen
dc.date.issued2021-09-27en
dc.description.abstractElectronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive1 and a GitHub repository2.en
dc.description.notesYes, full paper (Peer reviewed?)en
dc.description.versionSubmitted versionen
dc.format.extentPages 230-233en
dc.format.extent4 page(s)en
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1109/jcdl52503.2021.00066en
dc.identifier.eissn2575-8152en
dc.identifier.isbn9781665417709en
dc.identifier.issn2575-7865en
dc.identifier.orcidIngram, William [0000-0002-8307-8844]en
dc.identifier.orcidFox, Edward [0000-0003-1447-6870]en
dc.identifier.urihttps://hdl.handle.net/10919/117431en
dc.identifier.volume2021-Septemberen
dc.language.isoenen
dc.publisherIEEEen
dc.relation.urihttp://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000760315700026&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=930d57c9ac61a043676db62af60056c1en
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectDigital Librariesen
dc.subjectOptical Character Recognitionen
dc.subjectText Miningen
dc.subjectMetadata Extractionen
dc.subjectCRFen
dc.subjectBiLSTM-CRFen
dc.titleAutomatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertationsen
dc.title.serial2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021)en
dc.typeConference proceedingen
dc.type.dcmitypeTexten
dc.type.otherProceedings Paperen
dc.type.otherMeetingen
dc.type.otherBook in seriesen
pubs.finish-date2021-09-30en
pubs.organisational-group/Virginia Techen
pubs.organisational-group/Virginia Tech/Engineeringen
pubs.organisational-group/Virginia Tech/Engineering/Computer Scienceen
pubs.organisational-group/Virginia Tech/Libraryen
pubs.organisational-group/Virginia Tech/All T&R Facultyen
pubs.organisational-group/Virginia Tech/Engineering/COE T&R Facultyen
pubs.organisational-group/Virginia Tech/Library/Library assessment administratorsen
pubs.organisational-group/Virginia Tech/Library/Dean's officeen
pubs.organisational-group/Virginia Tech/Library/Information Technologyen
pubs.organisational-group/Virginia Tech/Graduate studentsen
pubs.organisational-group/Virginia Tech/Graduate students/Doctoral studentsen
pubs.start-date2021-09-27en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AutomaticMetadataExtraction-arXiv.pdf
Size:
250.08 KB
Format:
Adobe Portable Document Format
Description:
Submitted version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Plain Text
Description: