Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models

Dasu, Pradyumna Upendra

Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models

dc.contributor.author	Dasu, Pradyumna Upendra	en
dc.contributor.committeechair	Fox, Edward A.	en
dc.contributor.committeemember	Wang, Xuan	en
dc.contributor.committeemember	Chen, Yinlin	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2025-01-11T09:00:37Z	en
dc.date.available	2025-01-11T09:00:37Z	en
dc.date.issued	2025-01-10	en
dc.description.abstract	Digital libraries hold vast and diverse content, with electronic theses and dissertations (ETDs) being among the most diverse. ETDs span multiple disciplines and include unique terminology, making achieving clear and coherent topic representations challenging. Existing topic modeling techniques often struggle with such heterogeneous collections, leaving a gap in providing interpretable and meaningful topic labels. This thesis addresses these challenges through a three-step framework designed to improve topic modeling outcomes for ETD metadata. First, we developed a custom preprocessing pipeline to enhance data quality and ensure consistency in text analysis. Second, we applied and optimized multiple topic modeling techniques to uncover latent themes, including LDA, ProdLDA, NeuralLDA, Contextualized Topic Models, and BERTopic. Finally, we integrated Large Language Models (LLMs), such as GPT-4, using prompt engineering to augment traditional topic models, refining and interpreting their outputs without replacing them. The framework was tested on a large corpus of ETD metadata, including through preliminary testing on a small subset. Quantitative metrics and user studies were used to evaluate performance, focusing on the clarity, accuracy, and relevance of the generated topics. The results demonstrated significant improvements in topic coherence and interpretability, with user study participants highlighting the value of the enhanced representations. These findings underscore the potential of combining customized preprocessing, advanced topic modeling, and LLM-driven refinements to better represent themes in complex collections like ETDs, providing a foundation for downstream tasks such as searching, browsing, and recommendation.	en
dc.description.abstractgeneral	Digital libraries store vast information, including books, research papers, and electronic theses and dissertations (ETDs). ETDs are incredibly diverse, covering most academic fields and using highly specialized language. This diversity makes it challenging to create clear and meaningful summaries of the main themes within these collections. Our study addresses this challenge by developing a three-step framework and applying it to ETDs. First, we cleaned and standardized the data to make it easier to analyze. Second, we used advanced techniques to uncover patterns and group similar topics together. Finally, we improved these topics using powerful tools like GPT-4, which helped make the themes more precise, more accurate, and easier to interpret. We tested this framework on both a small and a large collection of ETDs. Combining quantitative evaluations and user feedback showed that our methods significantly improved how the topics represented the content. This work lays the foundation for more effective future tools to help people search, explore, and navigate large collections of academic works.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:42363	en
dc.identifier.uri	https://hdl.handle.net/10919/124154	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	Topic Modeling	en
dc.subject	Natural Language Processing	en
dc.subject	Large Language Models	en
dc.subject	Electronic Theses and Dissertations	en
dc.subject	Digital Libraries	en
dc.subject	Information Storage and Retrieval	en
dc.subject	Artificial Intelligence	en
dc.subject	Search and Recommendation	en
dc.title	Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dasu_P_T_2025.pdf
Size:: 1.85 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses