Browsing by Author "Sahana Bhaskar"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- Team 3: Object Detection and Topic Modeling (Fall 2023)Amr Ahmed Aboelnaga; Anushka Sivakumar; Jayanth Narla; Pradyumna Upendra Dasu; Ragul Seetharaman; Sahana Bhaskar; Shankar Srinidhi Srinivas (2024-01-08)Under the guidance of Dr. Edward A. Fox, the CS 5604: Information Storage and Retrieval class (Fall 2023) was tasked with developing a cutting-edge information retrieval system to facilitate Electronic Theses and Dissertations (ETDs). We used learning algorithms on a large ETD collection to classify closely related documents. The project’s overarching objective is to enhance the already available service, which enables users to upload, search, and retrieve ETDs along with their associated digital objects in a human-readable format. Our team’s specific assignment is to use object detection and topic modeling to analyze documents and thereby assist in building a system that supports searching and retrieving documents using topics and user defined digital objects, and enables experimenters to conduct further research into objects and topics. To achieve this effort we have implemented object detection on 200 segmented ETDs and topic modeling using BERTopic (BERT embeddings) and LDA (Latent Dirichlet Allocation) on nearly 334k ETDs. The object detection and topic modeling pipelines have been modified to utilize APIs (Application Programming Interfaces) for populating database tables related to ETDs. Each ETD page is converted into an image and stored in the file system, with corresponding entries in the database. Additionally, all detected objects are stored both in the database and the file system. The generated XMLs now include an object ID for each detected object, facilitating the capture of structural relationships using knowledge graphs (Team 1). Efforts have also been invested in enhancing chapter segmentation in XMLs. This involves exploring and experimenting with the LLaMA 2 model, ResNet model, and clustering approaches to accurately identify the start and end pages of chapters.The topic modeling results using BERTopic were not satisfactory, leading to exploration of the LDA model. Switching to the LDA model has provided promising outputs. The topics generated using LDA were refined using various pre-processing techniques and given to team 6 to be used in the sign-up page, and to team 2 for indexing.