Team 4: Segmentation, Summarization, and Classification
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Under the guidance of Dr. Edward A. Fox, the class of CS5604 Fall 2022 semester at Virginia Tech was assigned the task of building an Information Retrieval and Analysis System that can support a collection of at least 5000 Electronic Theses and Dissertations (ETDs). The system would act as a search engine, supporting a number of features, such as searching, providing recommendations, ranking search results, and browsing. In order to achieve this, the class was divided into five teams, each assigned separate tasks with the intent of collaborating through CI/CD. The roles can be described as follows: Content and Representation, End-user Recommendation and Search, Object Detection and Topic Models, Classification and Summarization with Language Models, and Integration and Coordination. The intent of the report is to outline the contribution of Team 4, which focuses on language models, classification, summarization, and segmentation. In this project, Team 4 was successful in reproducing Akbar Javaid Manzoor’s pipeline to segment ETD into chapters, summarize the segmented chapters using extractive and abstractive summarizing techniques, and classify the chapters using deep learning and language models. Using the APIs developed by Team 1, Team 4 was also tasked with storing the outcomes of 5000 ETDs in the file system and database. Team 4 containerized the services and assisted Team 5 with workflow automation to help automate the services. The project’s main lessons were effective team collaboration, efficient code maintenance, containerization of services, upkeep of a CI/CD workflow, and finally effective information storage retrieval at scale. The report describes the goals, tasks, and achievements, along with our coordination with the other teams in completing the higher-level tasks concerning the entire project.