Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019

Kaushal, Kulendra Kumar; Kulkarni, Rutwik; Sumant, Aarohi; Wang, Chaoran; Yuan, Chenhan; Yuan, Liling

Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019

dc.contributor.author	Kaushal, Kulendra Kumar	en
dc.contributor.author	Kulkarni, Rutwik	en
dc.contributor.author	Sumant, Aarohi	en
dc.contributor.author	Wang, Chaoran	en
dc.contributor.author	Yuan, Chenhan	en
dc.contributor.author	Yuan, Liling	en
dc.date.accessioned	2020-01-17T18:03:10Z	en
dc.date.available	2020-01-17T18:03:10Z	en
dc.date.issued	2019-12-23	en
dc.description.abstract	The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.	en
dc.description.notes	CMEreport.pdf - primary report document CMEreport.zip - LaTeX project used to generate the report CMEpresentation.pdf - final presentation PDF CMEpresentation.pptx - final presentation as PowerPoint CMEsourceCode.zip - source code archive	en
dc.description.sponsorship	IMLS: LG-37-19-0078-19.	en
dc.identifier.uri	http://hdl.handle.net/10919/96484	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	en
dc.subject	Collection Management	en
dc.subject	Electronic Theses and Dissertations	en
dc.subject	Metadata Extraction	en
dc.subject	Text Preprocessing	en
dc.subject	Automation Suite	en
dc.title	Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019	en
dc.type	Presentation	en
dc.type	Report	en