Building datasets to support information extraction and structure parsing from electronic theses and dissertations

Ingram, William A.; Wu, Jian; Kahu, Sampanna Yashwant; Manzoor, Javaid Akbar; Banerjee, Bipasha; Ahuja, Aman; Choudhury, Muntabir Hasan; Salsabil, Lamia; Shields, Winston; Fox, Edward A.

Building datasets to support information extraction and structure parsing from electronic theses and dissertations

dc.contributor.author	Ingram, William A.	en
dc.contributor.author	Wu, Jian	en
dc.contributor.author	Kahu, Sampanna Yashwant	en
dc.contributor.author	Manzoor, Javaid Akbar	en
dc.contributor.author	Banerjee, Bipasha	en
dc.contributor.author	Ahuja, Aman	en
dc.contributor.author	Choudhury, Muntabir Hasan	en
dc.contributor.author	Salsabil, Lamia	en
dc.contributor.author	Shields, Winston	en
dc.contributor.author	Fox, Edward A.	en
dc.date.accessioned	2025-11-24T18:34:02Z	en
dc.date.available	2025-11-24T18:34:02Z	en
dc.date.issued	2024-06-01	en
dc.description.abstract	Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs.	en
dc.description.sponsorship	Institute of Museum and Library Services [LG-37-19-0078-19]; Institute of Museum and Library Services; John Pratt (ODU); Amazon Web Services	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1007/s00799-024-00395-4	en
dc.identifier.eissn	1432-1300	en
dc.identifier.issn	1432-5012	en
dc.identifier.issue	2	en
dc.identifier.uri	https://hdl.handle.net/10919/139737	en
dc.identifier.volume	25	en
dc.language.iso	en	en
dc.publisher	Springer	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	Electronic theses and dissertations	en
dc.subject	Document structure analysis	en
dc.subject	Information extraction	en
dc.subject	Scholarly text mining	en
dc.subject	Benchmark datasets	en
dc.title	Building datasets to support information extraction and structure parsing from electronic theses and dissertations	en
dc.title.serial	International Journal on Digital Libraries	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: IngramBuilding.pdf
Size:: 2.66 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

Scholarly Works, University Libraries
Scholarly Works, Computer Science
Scholarly Works, Electrical and Computer Engineering