Improving Access to ETD Elements Through Chapter Categorization and Summarization

TR Number

Date

2024-08-07

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

The field of natural language processing and information retrieval has made remarkable progress since the 1980s. However, most of the theoretical investigation and applied experimentation is focused on short documents like web pages, journal articles, or papers in conference proceedings. Electronic Theses and Dissertations (ETDs) contain a wealth of information. These book-length documents describe research conducted in a variety of academic disciplines. While current digital library systems can be directly used to find a document of interest, they do not also facilitate discovering what specific parts or segments are of particular interest. This research aims to improve access to ETD components by providing users with chapter-level classification labels and summaries to help easily find portions of interest. We explore the challenges such documents pose, especially when dealing with a highly specialized academic vocabulary. We use large language models (LLMs) and fine-tune pre-trained models for these downstream tasks. We also develop a method to connect the ETD discipline and the department information to an ETD-centric classification system. To help guide the summarization model to create better chapter summaries, for each chapter, we try to identify relevant sentences of the document abstract, plus the titles of cited references from the bibliography. We leverage human feedback that helps us evaluate models qualitatively on top of using traditional metrics. We provide users with chapter classification labels and summaries to improve access to ETD chapters. We generate the top three classification labels for each chapter that reflect the interdisciplinarity of the work in ETDs. Our evaluation proves that our ensemble methods yield summaries that are preferred by users. Our summaries also perform better than summaries generated by using a single method when evaluated on several metrics using an LLM-based evaluation methodology.

Description

Keywords

Summarization, Classification, Natural Language Processing, Machine Learning, Language Models

Citation