Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models

dc.contributor.authorDasu, Pradyumna Upendraen
dc.contributor.committeechairFox, Edward A.en
dc.contributor.committeememberWang, Xuanen
dc.contributor.committeememberChen, Yinlinen
dc.contributor.departmentComputer Science and#38; Applicationsen
dc.date.accessioned2025-01-11T09:00:37Zen
dc.date.available2025-01-11T09:00:37Zen
dc.date.issued2025-01-10en
dc.description.abstractDigital libraries hold vast and diverse content, with electronic theses and dissertations (ETDs) being among the most diverse. ETDs span multiple disciplines and include unique terminology, making achieving clear and coherent topic representations challenging. Existing topic modeling techniques often struggle with such heterogeneous collections, leaving a gap in providing interpretable and meaningful topic labels. This thesis addresses these challenges through a three-step framework designed to improve topic modeling outcomes for ETD metadata. First, we developed a custom preprocessing pipeline to enhance data quality and ensure consistency in text analysis. Second, we applied and optimized multiple topic modeling techniques to uncover latent themes, including LDA, ProdLDA, NeuralLDA, Contextualized Topic Models, and BERTopic. Finally, we integrated Large Language Models (LLMs), such as GPT-4, using prompt engineering to augment traditional topic models, refining and interpreting their outputs without replacing them. The framework was tested on a large corpus of ETD metadata, including through preliminary testing on a small subset. Quantitative metrics and user studies were used to evaluate performance, focusing on the clarity, accuracy, and relevance of the generated topics. The results demonstrated significant improvements in topic coherence and interpretability, with user study participants highlighting the value of the enhanced representations. These findings underscore the potential of combining customized preprocessing, advanced topic modeling, and LLM-driven refinements to better represent themes in complex collections like ETDs, providing a foundation for downstream tasks such as searching, browsing, and recommendation.en
dc.description.abstractgeneralDigital libraries store vast information, including books, research papers, and electronic theses and dissertations (ETDs). ETDs are incredibly diverse, covering most academic fields and using highly specialized language. This diversity makes it challenging to create clear and meaningful summaries of the main themes within these collections. Our study addresses this challenge by developing a three-step framework and applying it to ETDs. First, we cleaned and standardized the data to make it easier to analyze. Second, we used advanced techniques to uncover patterns and group similar topics together. Finally, we improved these topics using powerful tools like GPT-4, which helped make the themes more precise, more accurate, and easier to interpret. We tested this framework on both a small and a large collection of ETDs. Combining quantitative evaluations and user feedback showed that our methods significantly improved how the topics represented the content. This work lays the foundation for more effective future tools to help people search, explore, and navigate large collections of academic works.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:42363en
dc.identifier.urihttps://hdl.handle.net/10919/124154en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.subjectTopic Modelingen
dc.subjectNatural Language Processingen
dc.subjectLarge Language Modelsen
dc.subjectElectronic Theses and Dissertationsen
dc.subjectDigital Librariesen
dc.subjectInformation Storage and Retrievalen
dc.subjectArtificial Intelligenceen
dc.subjectSearch and Recommendationen
dc.titleTopic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Modelsen
dc.typeThesisen
thesis.degree.disciplineComputer Science & Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Dasu_P_T_2025.pdf
Size:
1.85 MB
Format:
Adobe Portable Document Format

Collections