Browsing by Author "Dinh, Kevin"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- A New Annotation Method and Dataset for Layout Analysis of Long DocumentsAhuja, Aman; Dinh, Kevin; Dinh, Brian; Ingram, William A.; Fox, Edward A. (ACM, 2023-05)Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.
- Object DetectionDinh, Kevin; Dinh, Brian; Leavitt, Andrew; Tran, Annie (Virginia Tech, 2022-12-15)For this project, our team took 20,000 image samples from ETDs and annotated them using a Python package called PyLabel. PyLabel is an open-source Python library used to label PDFs. PyLabel can also take a trained dataset and use it for AI-aided annotations. We also created a pipeline in order to divide the dataset into equal pieces, where a user can select the number of samples they want to annotate. Then old sample data is cleared out and replaced with new sample data that contains classes with low accuracy. Finally, we saved the annotations as a YOLOv7 .txt file which is accumulated in order to retrain the model with 10,000 annotated images and finally with 20,000 annotated pages. With these annotated pages we conducted an experiment timing how long it takes to annotate the pages to see the improvement of the average time per page to annotate as the different models were trained. We concluded that the model trained with 10,000 pages was significantly faster than the original model.