Figure Extraction Website

This project aimed to extract figures from theses and dissertations, index them, and support searching of those figures. Figure Extraction Website intends to fix the problem of users having to go through each PDF file and find figures that match their interests. Instead, Figure Extraction Website allows curators or users to upload PDF files from their computer, and then support searches by the appropriate keyword inputs. Figures can be searched by the caption text or the words within the figures. Two open source tools, PDFFigures2 and PDFPlumber, are used to extract figures from the PDF files. Then, ElasticSearch is used to index the figures, captions, and document metadata. The website is built based on the ETDUI website, which was given to our group by our clients. ETDUI allows entry of keywords and output of PDF files that have the keyword in the title or in the summary portion of PDF files. To focus on our aims, we removed some features of ETDUI, including login, register, advanced search, and voice search. Then, our group added some features, including a file select button and an upload button, so that users can easily upload PDF files. Currently, the website is running on localhost, which can be cloned from the GitHub repository (https://github.com/JRReynosa/CS4624_Figure_Extraction_Website). The PDF files that are uploaded are stored locally, with path information given in the website. Testing proceeded with uploading a few PDF files. Searching was tested with keyword queries. There still exist problems, such as to find words or mathematical equations within the figures, as opposed to those within the captions.

Keywords

ETD, electronic theses and dissertations, thesis, dissertation, figure, extraction, Elasticsearch, PDFFigures2, PDFPlumber, PDF, Metadata, Figure Extraction, Python, ndjson, json, Haystack, sbt, scala, linux, kibana

Persistent link

http://hdl.handle.net/10919/109993

Collections

CS4624: Multimedia, Hypertext, and Information Access

Full item page

Figure Extraction Website

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections