Figure Extraction Website

dc.contributor.authorReynosa, Jonathanen
dc.contributor.authorMin, Sa Hyunen
dc.date.accessioned2022-05-11T13:36:17Zen
dc.date.available2022-05-11T13:36:17Zen
dc.date.issued2022-05-11en
dc.description.abstractThis project aimed to extract figures from theses and dissertations, index them, and support searching of those figures. Figure Extraction Website intends to fix the problem of users having to go through each PDF file and find figures that match their interests. Instead, Figure Extraction Website allows curators or users to upload PDF files from their computer, and then support searches by the appropriate keyword inputs. Figures can be searched by the caption text or the words within the figures. Two open source tools, PDFFigures2 and PDFPlumber, are used to extract figures from the PDF files. Then, ElasticSearch is used to index the figures, captions, and document metadata. The website is built based on the ETDUI website, which was given to our group by our clients. ETDUI allows entry of keywords and output of PDF files that have the keyword in the title or in the summary portion of PDF files. To focus on our aims, we removed some features of ETDUI, including login, register, advanced search, and voice search. Then, our group added some features, including a file select button and an upload button, so that users can easily upload PDF files. Currently, the website is running on localhost, which can be cloned from the GitHub repository (https://github.com/JRReynosa/CS4624_Figure_Extraction_Website). The PDF files that are uploaded are stored locally, with path information given in the website. Testing proceeded with uploading a few PDF files. Searching was tested with keyword queries. There still exist problems, such as to find words or mathematical equations within the figures, as opposed to those within the captions.en
dc.description.notesFigure Extraction Website Presentation: CS 4624 Final Figure Extraction Website Presentation detailing the project and work accomplished over the semester. This has been provided in both a .pdf and .pptx format. Figure Extraction Website Report: CS 4624 Final Figure Extraction Website Report shows the project, the pipelines in the project, and its complete implementation in detail. This has been provided in both a .pdf and .docx format.en
dc.description.sponsorshipIMLS LG-37-19-0078-19en
dc.identifier.urihttp://hdl.handle.net/10919/109993en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.subjectETDen
dc.subjectelectronic theses and dissertationsen
dc.subjectthesisen
dc.subjectdissertationen
dc.subjectfigureen
dc.subjectextractionen
dc.subjectElasticsearchen
dc.subjectPDFFigures2en
dc.subjectPDFPlumberen
dc.subjectPDFen
dc.subjectMetadataen
dc.subjectFigure Extractionen
dc.subjectPythonen
dc.subjectndjsonen
dc.subjectjsonen
dc.subjectHaystacken
dc.subjectsbten
dc.subjectscalaen
dc.subjectlinuxen
dc.subjectkibanaen
dc.titleFigure Extraction Websiteen
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Name:
FigureExtractionWebsiteReport.docx
Size:
996.98 KB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
FigureExtractionWebsiteReport.pdf
Size:
1017.77 KB
Format:
Adobe Portable Document Format
Name:
FigureExtractionWebsitePresentation.pptx
Size:
11.47 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
FigureExtractionWebsitePresentation.pdf
Size:
1.77 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: