Chapter Summarization

Simple item page

dc.contributor.author	Peta, Manasi	en
dc.contributor.author	Simms, Aidan	en
dc.contributor.author	Grilli, Joe	en
dc.contributor.author	Chokkan, Nandha	en
dc.date.accessioned	2023-07-05T17:43:11Z	en
dc.date.available	2023-07-05T17:43:11Z	en
dc.date.issued	2023-05-12	en
dc.description.abstract	A thesis is the amalgamation of research that serves as the final product of a graduate student’s knowledge about the information they learned throughout their graduate research. A dissertation is a graduate student’s opportunity to present their original research that they have worked on during a doctorate program to contribute new theories, practices, or knowledge to their field. Theses and dissertations represent the culmination of research of students and therefore can be extremely long. Electronic theses and dissertations (ETDs) are the digital versions of theses and dissertations so that the research and knowledge explored can be more accessible to the world. ETDs typically contain an abstract describing the work done in the document. However, these abstracts are simply too general, which means they often don’t help readers. There is no happy medium between getting essentially no information from generic abstracts and reading through a dense paper. This is an issue on a global scale. We created chapter summaries of ETDs which aim to help readers decompose and understand the documents faster. We make use of existing machine learning summarization models, specifically Python packages and language models, to help with the summarization. Part of this project is to create a dataset we can work with to create and test our summarization model on. This summarization dataset has been created by annotating the chapters from 100 ETDs (after chapter segmentation). We want to be as diverse as possible, while also being able to pick up on patterns, which is why our ETDs are from a plethora of fields. We have implemented a data extraction pipeline that builds on work done by the Object Retrieval Code from Aman Ahuja et al. Based on this we have created a summarization framework that accepts the chapter text as input and generates chapter summaries that are integrated into the given base front-end website application. We have completed 4 summarization scripts that utilize pre-trained models from Hugging Face which intake the data extracted from the chapter and output a summary of the input data. The four models we used were BART, BigBirdPegasus, T5, and Longformer Encoder Decoder (LED). We were able to use these scripts on all the chapters that we manually segmented to get summaries of all the chapters. We organized these summaries based on what model we used to obtain them in our GitLab repository. We used these summaries to populate a database which was intended to be used for the search functionality of our frontend application. There is more about the specifics of the backend and frontend in section 6.0 Implementation. We gained a holistic understanding of working on a full-stack project. On the backend portion, we learned how to use existing libraries and resources like pandas, PDFPlumber, and WordNinja to extract and format data from an input source. We also learned how to use resources like Hugging Face to understand natural language processing models and the pros and cons of various types of models. By creating scripts to utilize such models for text summarization, we were able to learn the nuances of working with pre-trained models and understand how that can affect our product. For example, if a model was pre-trained on a massive text repository, then it had better chances of recognizing more uncommon words in ETDs. On the frontend portion, we gained experience using React and JavaScript to create a functioning website. We also learned the process of understanding, dissecting, and updating a codebase we inherited from another team. We learned how to create and populate a database in PostgreSQL (commonly referred to as Postgres).	en
dc.description.notes	ETDsBatch1.xlsx - Spreadsheet giving status for first batch of ETDs ETDsBatch2.docx - Word version of document giving status for second batch of ETDs ChapterSummarizationReport.pdf - PDF version of final report ChapterSummarizationReport.docx - Word version of final report from a Google Doc ChapterSummarizationPresentation.pdf - PDF version of final presentation ChapterSummarizationPresentation.pptx - PowerPoint version of final presentation	en
dc.identifier.uri	http://hdl.handle.net/10919/115646	en
dc.identifier.url	https://docs.google.com/document/d/1nmr9euZZgmgNX187H5iSC33MWlNkEWzQ9FLKm5EqIAI/edit?usp=sharing	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	en
dc.subject	ETDs	en
dc.subject	Chapter Summarization	en
dc.title	Chapter Summarization	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Other	en