Figure Extraction from Scanned Electronic Theses and Dissertations

dc.contributor.authorKahu, Sampanna Yashwanten
dc.contributor.committeechairFox, Edward A.en
dc.contributor.committeememberDiehl, William J.en
dc.contributor.committeememberAbbott, A. Lynnen
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2020-09-30T08:00:25Zen
dc.date.available2020-09-30T08:00:25Zen
dc.date.issued2020-09-29en
dc.description.abstractThe ability to extract figures and tables from scientific documents can solve key use-cases such as their semantic parsing, summarization, or indexing. Although a few methods have been developed to extract figures and tables from scientific documents, their performance on scanned counterparts is considerably lower than on born-digital ones. To facilitate this, we propose methods to effectively extract figures and tables from Electronic Theses and Dissertations (ETDs), that out-perform existing methods by a considerable margin. Our contribution towards this goal is three-fold. (a) We propose a system/model for improving the performance of existing methods on scanned scientific documents for figure and table extraction. (b) We release a new dataset containing 10,182 labelled page-images spanning across 70 scanned ETDs with 3.3k manually annotated bounding boxes for figures and tables. (c) Lastly, we release our entire code and the trained model weights to enable further research (https://github.com/SampannaKahu/deepfigures-open).en
dc.description.abstractgeneralPortable Document Format (PDF) is one of the most popular document formats. However, parsing PDF files is not a trivial task. One use-case of parsing PDF files is the search functionality on websites hosting scholarly documents (i.e., IEEE Xplore, etc.). Having the ability to extract figures and tables from a scholarly document helps this use-case, among others. Methods using deep learning exist which extract figures from scholarly documents. However, a large number of scholarly documents, especially the ones published before the advent of computers, have been scanned from hard paper copies into PDF. In particular, we focus on scanned PDF versions of long documents, such as Electronic Theses and Dissertations (ETDs). No experiments have been done yet that evaluate the efficacy of the above-mentioned methods on this scanned corpus. This work explores and attempts to improve the performance of these existing methods on scanned ETDs. A new gold standard dataset is created and released as a part of this work for figure extraction from scanned ETDs. Finally, the entire source code and trained model weights are made open-source to aid further research in this field.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:27273en
dc.identifier.urihttp://hdl.handle.net/10919/100113en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectFigure Extractionen
dc.subjectDeep learning (Machine learning)en
dc.subjectComputer Visionen
dc.subjectDigital Librariesen
dc.titleFigure Extraction from Scanned Electronic Theses and Dissertationsen
dc.typeThesisen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 5 of 5
Loading...
Thumbnail Image
Name:
Kahu_SY_T_2020.pdf
Size:
6.89 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Kahu_SY_T_2020_support_5.pdf
Size:
6.27 MB
Format:
Adobe Portable Document Format
Description:
Supporting documents
Name:
Kahu_SY_T_2020_support_3.zip
Size:
563.15 KB
Format:
Description:
Supporting documents
Name:
Kahu_SY_T_2020_support_4.pptx
Size:
10.04 MB
Format:
Microsoft Powerpoint XML
Description:
Supporting documents
Loading...
Thumbnail Image
Name:
Kahu_SY_T_2020_support_1.pdf
Size:
36.97 KB
Format:
Adobe Portable Document Format
Description:
Supporting documents

Collections