Figure Extraction Website

Reynosa, Jonathan; Min, Sa Hyun

Figure Extraction Website

dc.contributor.author	Reynosa, Jonathan	en
dc.contributor.author	Min, Sa Hyun	en
dc.date.accessioned	2022-05-11T13:36:17Z	en
dc.date.available	2022-05-11T13:36:17Z	en
dc.date.issued	2022-05-11	en
dc.description.abstract	This project aimed to extract figures from theses and dissertations, index them, and support searching of those figures. Figure Extraction Website intends to fix the problem of users having to go through each PDF file and find figures that match their interests. Instead, Figure Extraction Website allows curators or users to upload PDF files from their computer, and then support searches by the appropriate keyword inputs. Figures can be searched by the caption text or the words within the figures. Two open source tools, PDFFigures2 and PDFPlumber, are used to extract figures from the PDF files. Then, ElasticSearch is used to index the figures, captions, and document metadata. The website is built based on the ETDUI website, which was given to our group by our clients. ETDUI allows entry of keywords and output of PDF files that have the keyword in the title or in the summary portion of PDF files. To focus on our aims, we removed some features of ETDUI, including login, register, advanced search, and voice search. Then, our group added some features, including a file select button and an upload button, so that users can easily upload PDF files. Currently, the website is running on localhost, which can be cloned from the GitHub repository (https://github.com/JRReynosa/CS4624_Figure_Extraction_Website). The PDF files that are uploaded are stored locally, with path information given in the website. Testing proceeded with uploading a few PDF files. Searching was tested with keyword queries. There still exist problems, such as to find words or mathematical equations within the figures, as opposed to those within the captions.	en
dc.description.notes	Figure Extraction Website Presentation: CS 4624 Final Figure Extraction Website Presentation detailing the project and work accomplished over the semester. This has been provided in both a .pdf and .pptx format. Figure Extraction Website Report: CS 4624 Final Figure Extraction Website Report shows the project, the pipelines in the project, and its complete implementation in detail. This has been provided in both a .pdf and .docx format.	en
dc.description.sponsorship	IMLS LG-37-19-0078-19	en
dc.identifier.uri	http://hdl.handle.net/10919/109993	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	ETD	en
dc.subject	electronic theses and dissertations	en
dc.subject	thesis	en
dc.subject	dissertation	en
dc.subject	figure	en
dc.subject	extraction	en
dc.subject	Elasticsearch	en
dc.subject	PDFFigures2	en
dc.subject	PDFPlumber	en
dc.subject	PDF	en
dc.subject	Metadata	en
dc.subject	Figure Extraction	en
dc.subject	Python	en
dc.subject	ndjson	en
dc.subject	json	en
dc.subject	Haystack	en
dc.subject	sbt	en
dc.subject	scala	en
dc.subject	linux	en
dc.subject	kibana	en
dc.title	Figure Extraction Website	en
dc.type	Presentation	en
dc.type	Report	en