A New Annotation Method and Dataset for Layout Analysis of Long Documents

Ahuja, Aman; Dinh, Kevin; Dinh, Brian; Ingram, William A.; Fox, Edward A.

A New Annotation Method and Dataset for Layout Analysis of Long Documents

dc.contributor.author	Ahuja, Aman	en
dc.contributor.author	Dinh, Kevin	en
dc.contributor.author	Dinh, Brian	en
dc.contributor.author	Ingram, William A.	en
dc.contributor.author	Fox, Edward A.	en
dc.date.accessioned	2023-05-01T18:38:06Z	en
dc.date.available	2023-05-01T18:38:06Z	en
dc.date.issued	2023-05	en
dc.date.updated	2023-05-01T07:58:26Z	en
dc.description.abstract	Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3543873.3587609	en
dc.identifier.uri	http://hdl.handle.net/10919/114867	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.relation.ispartof	WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023	en
dc.rights	Creative Commons Attribution-NonCommercial 4.0 International	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	en
dc.title	A New Annotation Method and Dataset for Layout Analysis of Long Documents	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3543873.3587609.pdf
Size:: 1.81 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science
Scholarly Works, University Libraries