Browsing by Author "Kaushal, Kulendra Kumar"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Collection Management of Electronic Theses and Dissertations (CME) CS5604 Fall 2019Kaushal, Kulendra Kumar; Kulkarni, Rutwik; Sumant, Aarohi; Wang, Chaoran; Yuan, Chenhan; Yuan, Liling (Virginia Tech, 2019-12-23)The class ``CS 5604: Information Storage and Retrieval'' in the fall of 2019 is divided into six teams to enhance the usability of the corpus of electronic theses and dissertations maintained by Virginia Tech University Libraries. The ETD corpus consists of 14,055 doctoral dissertations and 19,246 masters theses from Virginia Tech University Libraries’ VTechWorks system. Our study explored document collection and processing, application of Elasticsearch to the collection to facilitate searching, testing a custom front-end, Kibana, integration, implementation, text analytics, and machine learning. The result of our work would help future researchers study the natural language processed data using deep learning technologies, address the challenges of extracting information from ETDs, etc. The Collection Management of Electronic Theses and Dissertations (CME) team was responsible for processing all PDF files from the ETD corpus and extracting well-formatted text files from them. We also used advanced deep learning and other tools like GROBID to process metadata, obtain text documents, and generate chapter-wise data. In this project, the CME team completed the following steps: comparing different parsers; doing document segmentation; preprocessing the data; and specifying, extracting, and preparing metadata and auxiliary information for indexing. We finally developed a system that automates all the above-mentioned tasks. The system also validates the output metadata, thereby ensuring the correctness of the data that flows through the entire system developed by the class. This system, in turn, helps to ingest new documents into Elasticsearch.
- Information Extraction of Technical Details From Scholarly ArticlesKaushal, Kulendra Kumar (Virginia Tech, 2021-06-16)Researchers have made significant progress in information extraction from short documents in the last few years, including social media interaction, news articles, and email excerpts. This research aims to extract technical entities like hardware resources, computing platforms, compute time, programming language, and libraries from scholarly research articles. Research articles are generally long documents having both salient as well as non-salient entities. Analyzing the cross-sectional relation, filtering the relevant information, measuring the saliency of mentioned entities, and extracting novel entities are some of the technical challenges involved in this research. This work presents a detailed study about the performance, effectiveness, and scalability of rule-based weakly supervised algorithms. We also develop an automated end-to-end Research Entity and Relationship Extractor (E2R Extractor). Additionally, we perform a comprehensive study about the effectiveness of existing deep learning-based information extraction tools like Dygie, Dygie++, SciREX. The research also contributes a dataset containing novel entities annotated in BILUO format and represents the baseline results using the E2R extractor on the proposed dataset. The results indicate that the E2R extractor successfully extracts salient entities from research articles.