Improving Access to ETD Elements Through Chapter Categorization and Summarization
dc.contributor.author | Banerjee, Bipasha | en |
dc.contributor.committeechair | Fox, Edward A. | en |
dc.contributor.committeemember | Zhou, Dawei | en |
dc.contributor.committeemember | Bhattacharya, Debswapna | en |
dc.contributor.committeemember | Lourentzou, Ismini | en |
dc.contributor.committeemember | Wu, Jian | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2024-08-08T08:00:10Z | en |
dc.date.available | 2024-08-08T08:00:10Z | en |
dc.date.issued | 2024-08-07 | en |
dc.description.abstract | The field of natural language processing and information retrieval has made remarkable progress since the 1980s. However, most of the theoretical investigation and applied experimentation is focused on short documents like web pages, journal articles, or papers in conference proceedings. Electronic Theses and Dissertations (ETDs) contain a wealth of information. These book-length documents describe research conducted in a variety of academic disciplines. While current digital library systems can be directly used to find a document of interest, they do not also facilitate discovering what specific parts or segments are of particular interest. This research aims to improve access to ETD components by providing users with chapter-level classification labels and summaries to help easily find portions of interest. We explore the challenges such documents pose, especially when dealing with a highly specialized academic vocabulary. We use large language models (LLMs) and fine-tune pre-trained models for these downstream tasks. We also develop a method to connect the ETD discipline and the department information to an ETD-centric classification system. To help guide the summarization model to create better chapter summaries, for each chapter, we try to identify relevant sentences of the document abstract, plus the titles of cited references from the bibliography. We leverage human feedback that helps us evaluate models qualitatively on top of using traditional metrics. We provide users with chapter classification labels and summaries to improve access to ETD chapters. We generate the top three classification labels for each chapter that reflect the interdisciplinarity of the work in ETDs. Our evaluation proves that our ensemble methods yield summaries that are preferred by users. Our summaries also perform better than summaries generated by using a single method when evaluated on several metrics using an LLM-based evaluation methodology. | en |
dc.description.abstractgeneral | Natural language processing (NLP) is a field in computer science that focuses on creating artificially intelligent models capable of processing text and audio similarly to humans. We make use of various NLP techniques, ranging from machine learning and language models, to provide users with a much more granular level of information stored in Electronic Theses and Dissertations (ETDs). ETDs are documents submitted by students conducting research at the culmination of their degree. Such documents comprise research work in various academic disciplines and thus contain a wealth of information. This work aims to make such information stored in chapters of ETDs more accessible to readers through the addition of chapter-level classification labels and summaries. We provide users with chapter classification labels and summaries to improve access to ETD chapters. We generate the top three classification labels for each chapter that reflect the interdisciplinarity of the work in ETDs. Alongside human evaluation of automatically generated summaries, we use an LLM-based approach that aims to score summaries on several metrics. Our evaluation proves that our methods yield summaries that users prefer to summaries generated by using a single method. | en |
dc.description.degree | Doctor of Philosophy | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:41212 | en |
dc.identifier.uri | https://hdl.handle.net/10919/120890 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution 4.0 International | en |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | en |
dc.subject | Summarization | en |
dc.subject | Classification | en |
dc.subject | Natural Language Processing | en |
dc.subject | Machine Learning | en |
dc.subject | Language Models | en |
dc.title | Improving Access to ETD Elements Through Chapter Categorization and Summarization | en |
dc.type | Dissertation | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | doctoral | en |
thesis.degree.name | Doctor of Philosophy | en |