Team 3: Object Detection and Topic Modeling (Fall 2023)

dc.contributor.authorAmr Ahmed Aboelnagaen
dc.contributor.authorAnushka Sivakumaren
dc.contributor.authorJayanth Narlaen
dc.contributor.authorPradyumna Upendra Dasuen
dc.contributor.authorRagul Seetharamanen
dc.contributor.authorSahana Bhaskaren
dc.contributor.authorShankar Srinidhi Srinivasen
dc.date.accessioned2024-04-25T02:40:28Zen
dc.date.available2024-04-25T02:40:28Zen
dc.date.issued2024-01-08en
dc.descriptionCS5604-team3-Final Presentation.pdf - This is the PDF version of the presentation of the work done by Team 3 in CS5604 Fall 2023 semester. CS5604-team3-Final Presentation.pptx - This is the presentation of the work done by Team 3 in CS5604 Fall 2023 semester. Final_report__Team3.pdf - This is the PDF version of the final report of the work done by Team 3 in CS5604 Fall 2023 semester. Final report- Team3.zip - This is the ZIP version of the final report of Team 3 on Overleaf.en
dc.description.abstractUnder the guidance of Dr. Edward A. Fox, the CS 5604: Information Storage and Retrieval class (Fall 2023) was tasked with developing a cutting-edge information retrieval system to facilitate Electronic Theses and Dissertations (ETDs). We used learning algorithms on a large ETD collection to classify closely related documents. The project’s overarching objective is to enhance the already available service, which enables users to upload, search, and retrieve ETDs along with their associated digital objects in a human-readable format. Our team’s specific assignment is to use object detection and topic modeling to analyze documents and thereby assist in building a system that supports searching and retrieving documents using topics and user defined digital objects, and enables experimenters to conduct further research into objects and topics. To achieve this effort we have implemented object detection on 200 segmented ETDs and topic modeling using BERTopic (BERT embeddings) and LDA (Latent Dirichlet Allocation) on nearly 334k ETDs. The object detection and topic modeling pipelines have been modified to utilize APIs (Application Programming Interfaces) for populating database tables related to ETDs. Each ETD page is converted into an image and stored in the file system, with corresponding entries in the database. Additionally, all detected objects are stored both in the database and the file system. The generated XMLs now include an object ID for each detected object, facilitating the capture of structural relationships using knowledge graphs (Team 1). Efforts have also been invested in enhancing chapter segmentation in XMLs. This involves exploring and experimenting with the LLaMA 2 model, ResNet model, and clustering approaches to accurately identify the start and end pages of chapters.The topic modeling results using BERTopic were not satisfactory, leading to exploration of the LDA model. Switching to the LDA model has provided promising outputs. The topics generated using LDA were refined using various pre-processing techniques and given to team 6 to be used in the sign-up page, and to team 2 for indexing.en
dc.identifier.urihttps://hdl.handle.net/10919/118665en
dc.language.isoen_USen
dc.rightsAttribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.subjectTopic Modelingen
dc.subjectObject Detectionen
dc.subjectInformation Storage Retrievalen
dc.subjectCS5604en
dc.titleTeam 3: Object Detection and Topic Modeling (Fall 2023)en
dc.typeTechnical Reporten
dc.typePresentationen

Files

Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
CS5604-team3-Final Presentation.pdf
Size:
4.7 MB
Format:
Adobe Portable Document Format
Name:
CS5604-team3-Final Presentation.pptx
Size:
4.88 MB
Format:
Microsoft Powerpoint XML
Name:
Final report- Team3.zip
Size:
7.38 MB
Format:
Loading...
Thumbnail Image
Name:
Final_report__Team3.pdf
Size:
7.76 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: