Team 3: Object Detection and Topic Modeling (Fall 2023)

Abstract

Under the guidance of Dr. Edward A. Fox, the CS 5604: Information Storage and Retrieval class (Fall 2023) was tasked with developing a cutting-edge information retrieval system to facilitate Electronic Theses and Dissertations (ETDs). We used learning algorithms on a large ETD collection to classify closely related documents. The project’s overarching objective is to enhance the already available service, which enables users to upload, search, and retrieve ETDs along with their associated digital objects in a human-readable format. Our team’s specific assignment is to use object detection and topic modeling to analyze documents and thereby assist in building a system that supports searching and retrieving documents using topics and user defined digital objects, and enables experimenters to conduct further research into objects and topics. To achieve this effort we have implemented object detection on 200 segmented ETDs and topic modeling using BERTopic (BERT embeddings) and LDA (Latent Dirichlet Allocation) on nearly 334k ETDs. The object detection and topic modeling pipelines have been modified to utilize APIs (Application Programming Interfaces) for populating database tables related to ETDs. Each ETD page is converted into an image and stored in the file system, with corresponding entries in the database. Additionally, all detected objects are stored both in the database and the file system. The generated XMLs now include an object ID for each detected object, facilitating the capture of structural relationships using knowledge graphs (Team 1). Efforts have also been invested in enhancing chapter segmentation in XMLs. This involves exploring and experimenting with the LLaMA 2 model, ResNet model, and clustering approaches to accurately identify the start and end pages of chapters.The topic modeling results using BERTopic were not satisfactory, leading to exploration of the LDA model. Switching to the LDA model has provided promising outputs. The topics generated using LDA were refined using various pre-processing techniques and given to team 6 to be used in the sign-up page, and to team 2 for indexing.

Description

CS5604-team3-Final Presentation.pdf - This is the PDF version of the presentation of the work done by Team 3 in CS5604 Fall 2023 semester. CS5604-team3-Final Presentation.pptx - This is the presentation of the work done by Team 3 in CS5604 Fall 2023 semester. Final_report__Team3.pdf - This is the PDF version of the final report of the work done by Team 3 in CS5604 Fall 2023 semester. Final report- Team3.zip - This is the ZIP version of the final report of Team 3 on Overleaf.

Keywords

Topic Modeling, Object Detection, Information Storage Retrieval, CS5604

Citation