Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019

dc.contributor.authorKaushal, Kulendra Kumaren
dc.contributor.authorKulkarni, Rutwiken
dc.contributor.authorSumant, Aarohien
dc.contributor.authorWang, Chaoranen
dc.contributor.authorYuan, Chenhanen
dc.contributor.authorYuan, Lilingen
dc.date.accessioned2020-01-17T18:03:10Zen
dc.date.available2020-01-17T18:03:10Zen
dc.date.issued2019-12-23en
dc.description.abstractThe class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.en
dc.description.notesCMEreport.pdf - primary report document CMEreport.zip - LaTeX project used to generate the report CMEpresentation.pdf - final presentation PDF CMEpresentation.pptx - final presentation as PowerPoint CMEsourceCode.zip - source code archiveen
dc.description.sponsorshipIMLS: LG-37-19-0078-19.en
dc.identifier.urihttp://hdl.handle.net/10919/96484en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/en
dc.subjectCollection Managementen
dc.subjectElectronic Theses and Dissertationsen
dc.subjectMetadata Extractionen
dc.subjectText Preprocessingen
dc.subjectAutomation Suiteen
dc.titleCollection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019en
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
CMEsourceCode.zip
Size:
8.64 MB
Format:
Loading...
Thumbnail Image
Name:
CMEpresentation.pdf
Size:
2.2 MB
Format:
Adobe Portable Document Format
Name:
CMEpresentation.pptx
Size:
3.61 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
CMEreport.pdf
Size:
1.87 MB
Format:
Adobe Portable Document Format
Name:
CMEreport.zip
Size:
2.2 MB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: