Figure Extraction Website
dc.contributor.author | Reynosa, Jonathan | en |
dc.contributor.author | Min, Sa Hyun | en |
dc.date.accessioned | 2022-05-11T13:36:17Z | en |
dc.date.available | 2022-05-11T13:36:17Z | en |
dc.date.issued | 2022-05-11 | en |
dc.description.abstract | This project aimed to extract figures from theses and dissertations, index them, and support searching of those figures. Figure Extraction Website intends to fix the problem of users having to go through each PDF file and find figures that match their interests. Instead, Figure Extraction Website allows curators or users to upload PDF files from their computer, and then support searches by the appropriate keyword inputs. Figures can be searched by the caption text or the words within the figures. Two open source tools, PDFFigures2 and PDFPlumber, are used to extract figures from the PDF files. Then, ElasticSearch is used to index the figures, captions, and document metadata. The website is built based on the ETDUI website, which was given to our group by our clients. ETDUI allows entry of keywords and output of PDF files that have the keyword in the title or in the summary portion of PDF files. To focus on our aims, we removed some features of ETDUI, including login, register, advanced search, and voice search. Then, our group added some features, including a file select button and an upload button, so that users can easily upload PDF files. Currently, the website is running on localhost, which can be cloned from the GitHub repository (https://github.com/JRReynosa/CS4624_Figure_Extraction_Website). The PDF files that are uploaded are stored locally, with path information given in the website. Testing proceeded with uploading a few PDF files. Searching was tested with keyword queries. There still exist problems, such as to find words or mathematical equations within the figures, as opposed to those within the captions. | en |
dc.description.notes | Figure Extraction Website Presentation: CS 4624 Final Figure Extraction Website Presentation detailing the project and work accomplished over the semester. This has been provided in both a .pdf and .pptx format. Figure Extraction Website Report: CS 4624 Final Figure Extraction Website Report shows the project, the pipelines in the project, and its complete implementation in detail. This has been provided in both a .pdf and .docx format. | en |
dc.description.sponsorship | IMLS LG-37-19-0078-19 | en |
dc.identifier.uri | http://hdl.handle.net/10919/109993 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Attribution 4.0 International | en |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | en |
dc.subject | ETD | en |
dc.subject | electronic theses and dissertations | en |
dc.subject | thesis | en |
dc.subject | dissertation | en |
dc.subject | figure | en |
dc.subject | extraction | en |
dc.subject | Elasticsearch | en |
dc.subject | PDFFigures2 | en |
dc.subject | PDFPlumber | en |
dc.subject | en | |
dc.subject | Metadata | en |
dc.subject | Figure Extraction | en |
dc.subject | Python | en |
dc.subject | ndjson | en |
dc.subject | json | en |
dc.subject | Haystack | en |
dc.subject | sbt | en |
dc.subject | scala | en |
dc.subject | linux | en |
dc.subject | kibana | en |
dc.title | Figure Extraction Website | en |
dc.type | Presentation | en |
dc.type | Report | en |
Files
Original bundle
1 - 4 of 4
Loading...
- Name:
- FigureExtractionWebsiteReport.pdf
- Size:
- 1017.77 KB
- Format:
- Adobe Portable Document Format
Loading...
- Name:
- FigureExtractionWebsitePresentation.pdf
- Size:
- 1.77 MB
- Format:
- Adobe Portable Document Format
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: