Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Digital libraries hold vast and diverse content, with electronic theses and dissertations (ETDs) being among the most diverse. ETDs span multiple disciplines and include unique terminology, making achieving clear and coherent topic representations challenging. Existing topic modeling techniques often struggle with such heterogeneous collections, leaving a gap in providing interpretable and meaningful topic labels. This thesis addresses these challenges through a three-step framework designed to improve topic modeling outcomes for ETD metadata. First, we developed a custom preprocessing pipeline to enhance data quality and ensure consistency in text analysis. Second, we applied and optimized multiple topic modeling techniques to uncover latent themes, including LDA, ProdLDA, NeuralLDA, Contextualized Topic Models, and BERTopic. Finally, we integrated Large Language Models (LLMs), such as GPT-4, using prompt engineering to augment traditional topic models, refining and interpreting their outputs without replacing them. The framework was tested on a large corpus of ETD metadata, including through preliminary testing on a small subset. Quantitative metrics and user studies were used to evaluate performance, focusing on the clarity, accuracy, and relevance of the generated topics. The results demonstrated significant improvements in topic coherence and interpretability, with user study participants highlighting the value of the enhanced representations. These findings underscore the potential of combining customized preprocessing, advanced topic modeling, and LLM-driven refinements to better represent themes in complex collections like ETDs, providing a foundation for downstream tasks such as searching, browsing, and recommendation.