Chapter Summarization

Abstract

A thesis is the amalgamation of research that serves as the final product of a graduate student’s knowledge about the information they learned throughout their graduate research. A dissertation is a graduate student’s opportunity to present their original research that they have worked on during a doctorate program to contribute new theories, practices, or knowledge to their field. Theses and dissertations represent the culmination of research of students and therefore can be extremely long. Electronic theses and dissertations (ETDs) are the digital versions of theses and dissertations so that the research and knowledge explored can be more accessible to the world. ETDs typically contain an abstract describing the work done in the document. However, these abstracts are simply too general, which means they often don’t help readers. There is no happy medium between getting essentially no information from generic abstracts and reading through a dense paper. This is an issue on a global scale. We created chapter summaries of ETDs which aim to help readers decompose and understand the documents faster. We make use of existing machine learning summarization models, specifically Python packages and language models, to help with the summarization. Part of this project is to create a dataset we can work with to create and test our summarization model on. This summarization dataset has been created by annotating the chapters from 100 ETDs (after chapter segmentation). We want to be as diverse as possible, while also being able to pick up on patterns, which is why our ETDs are from a plethora of fields. We have implemented a data extraction pipeline that builds on work done by the Object Retrieval Code from Aman Ahuja et al. Based on this we have created a summarization framework that accepts the chapter text as input and generates chapter summaries that are integrated into the given base front-end website application. We have completed 4 summarization scripts that utilize pre-trained models from Hugging Face which intake the data extracted from the chapter and output a summary of the input data. The four models we used were BART, BigBirdPegasus, T5, and Longformer Encoder Decoder (LED). We were able to use these scripts on all the chapters that we manually segmented to get summaries of all the chapters. We organized these summaries based on what model we used to obtain them in our GitLab repository. We used these summaries to populate a database which was intended to be used for the search functionality of our frontend application. There is more about the specifics of the backend and frontend in section 6.0 Implementation. We gained a holistic understanding of working on a full-stack project. On the backend portion, we learned how to use existing libraries and resources like pandas, PDFPlumber, and WordNinja to extract and format data from an input source. We also learned how to use resources like Hugging Face to understand natural language processing models and the pros and cons of various types of models. By creating scripts to utilize such models for text summarization, we were able to learn the nuances of working with pre-trained models and understand how that can affect our product. For example, if a model was pre-trained on a massive text repository, then it had better chances of recognizing more uncommon words in ETDs. On the frontend portion, we gained experience using React and JavaScript to create a functioning website. We also learned the process of understanding, dissecting, and updating a codebase we inherited from another team. We learned how to create and populate a database in PostgreSQL (commonly referred to as Postgres).

Description

Keywords

ETDs, Chapter Summarization

Citation