Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models
dc.contributor.author | Dasu, Pradyumna Upendra | en |
dc.contributor.committeechair | Fox, Edward A. | en |
dc.contributor.committeemember | Wang, Xuan | en |
dc.contributor.committeemember | Chen, Yinlin | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-01-11T09:00:37Z | en |
dc.date.available | 2025-01-11T09:00:37Z | en |
dc.date.issued | 2025-01-10 | en |
dc.description.abstract | Digital libraries hold vast and diverse content, with electronic theses and dissertations (ETDs) being among the most diverse. ETDs span multiple disciplines and include unique terminology, making achieving clear and coherent topic representations challenging. Existing topic modeling techniques often struggle with such heterogeneous collections, leaving a gap in providing interpretable and meaningful topic labels. This thesis addresses these challenges through a three-step framework designed to improve topic modeling outcomes for ETD metadata. First, we developed a custom preprocessing pipeline to enhance data quality and ensure consistency in text analysis. Second, we applied and optimized multiple topic modeling techniques to uncover latent themes, including LDA, ProdLDA, NeuralLDA, Contextualized Topic Models, and BERTopic. Finally, we integrated Large Language Models (LLMs), such as GPT-4, using prompt engineering to augment traditional topic models, refining and interpreting their outputs without replacing them. The framework was tested on a large corpus of ETD metadata, including through preliminary testing on a small subset. Quantitative metrics and user studies were used to evaluate performance, focusing on the clarity, accuracy, and relevance of the generated topics. The results demonstrated significant improvements in topic coherence and interpretability, with user study participants highlighting the value of the enhanced representations. These findings underscore the potential of combining customized preprocessing, advanced topic modeling, and LLM-driven refinements to better represent themes in complex collections like ETDs, providing a foundation for downstream tasks such as searching, browsing, and recommendation. | en |
dc.description.abstractgeneral | Digital libraries store vast information, including books, research papers, and electronic theses and dissertations (ETDs). ETDs are incredibly diverse, covering most academic fields and using highly specialized language. This diversity makes it challenging to create clear and meaningful summaries of the main themes within these collections. Our study addresses this challenge by developing a three-step framework and applying it to ETDs. First, we cleaned and standardized the data to make it easier to analyze. Second, we used advanced techniques to uncover patterns and group similar topics together. Finally, we improved these topics using powerful tools like GPT-4, which helped make the themes more precise, more accurate, and easier to interpret. We tested this framework on both a small and a large collection of ETDs. Combining quantitative evaluations and user feedback showed that our methods significantly improved how the topics represented the content. This work lays the foundation for more effective future tools to help people search, explore, and navigate large collections of academic works. | en |
dc.description.degree | Master of Science | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:42363 | en |
dc.identifier.uri | https://hdl.handle.net/10919/124154 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution 4.0 International | en |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | en |
dc.subject | Topic Modeling | en |
dc.subject | Natural Language Processing | en |
dc.subject | Large Language Models | en |
dc.subject | Electronic Theses and Dissertations | en |
dc.subject | Digital Libraries | en |
dc.subject | Information Storage and Retrieval | en |
dc.subject | Artificial Intelligence | en |
dc.subject | Search and Recommendation | en |
dc.title | Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models | en |
dc.type | Thesis | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | masters | en |
thesis.degree.name | Master of Science | en |
Files
Original bundle
1 - 1 of 1