Chapter Summarization

dc.contributor.authorPeta, Manasien
dc.contributor.authorSimms, Aidanen
dc.contributor.authorGrilli, Joeen
dc.contributor.authorChokkan, Nandhaen
dc.date.accessioned2023-07-05T17:43:11Zen
dc.date.available2023-07-05T17:43:11Zen
dc.date.issued2023-05-12en
dc.description.abstractA thesis is the amalgamation of research that serves as the final product of a graduate student’s knowledge about the information they learned throughout their graduate research. A dissertation is a graduate student’s opportunity to present their original research that they have worked on during a doctorate program to contribute new theories, practices, or knowledge to their field. Theses and dissertations represent the culmination of research of students and therefore can be extremely long. Electronic theses and dissertations (ETDs) are the digital versions of theses and dissertations so that the research and knowledge explored can be more accessible to the world. ETDs typically contain an abstract describing the work done in the document. However, these abstracts are simply too general, which means they often don’t help readers. There is no happy medium between getting essentially no information from generic abstracts and reading through a dense paper. This is an issue on a global scale. We created chapter summaries of ETDs which aim to help readers decompose and understand the documents faster. We make use of existing machine learning summarization models, specifically Python packages and language models, to help with the summarization. Part of this project is to create a dataset we can work with to create and test our summarization model on. This summarization dataset has been created by annotating the chapters from 100 ETDs (after chapter segmentation). We want to be as diverse as possible, while also being able to pick up on patterns, which is why our ETDs are from a plethora of fields. We have implemented a data extraction pipeline that builds on work done by the Object Retrieval Code from Aman Ahuja et al. Based on this we have created a summarization framework that accepts the chapter text as input and generates chapter summaries that are integrated into the given base front-end website application. We have completed 4 summarization scripts that utilize pre-trained models from Hugging Face which intake the data extracted from the chapter and output a summary of the input data. The four models we used were BART, BigBirdPegasus, T5, and Longformer Encoder Decoder (LED). We were able to use these scripts on all the chapters that we manually segmented to get summaries of all the chapters. We organized these summaries based on what model we used to obtain them in our GitLab repository. We used these summaries to populate a database which was intended to be used for the search functionality of our frontend application. There is more about the specifics of the backend and frontend in section 6.0 Implementation. We gained a holistic understanding of working on a full-stack project. On the backend portion, we learned how to use existing libraries and resources like pandas, PDFPlumber, and WordNinja to extract and format data from an input source. We also learned how to use resources like Hugging Face to understand natural language processing models and the pros and cons of various types of models. By creating scripts to utilize such models for text summarization, we were able to learn the nuances of working with pre-trained models and understand how that can affect our product. For example, if a model was pre-trained on a massive text repository, then it had better chances of recognizing more uncommon words in ETDs. On the frontend portion, we gained experience using React and JavaScript to create a functioning website. We also learned the process of understanding, dissecting, and updating a codebase we inherited from another team. We learned how to create and populate a database in PostgreSQL (commonly referred to as Postgres).en
dc.description.notesETDsBatch1.xlsx - Spreadsheet giving status for first batch of ETDs ETDsBatch2.docx - Word version of document giving status for second batch of ETDs ChapterSummarizationReport.pdf - PDF version of final report ChapterSummarizationReport.docx - Word version of final report from a Google Doc ChapterSummarizationPresentation.pdf - PDF version of final presentation ChapterSummarizationPresentation.pptx - PowerPoint version of final presentationen
dc.identifier.urihttp://hdl.handle.net/10919/115646en
dc.identifier.urlhttps://docs.google.com/document/d/1nmr9euZZgmgNX187H5iSC33MWlNkEWzQ9FLKm5EqIAI/edit?usp=sharingen
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en
dc.subjectETDsen
dc.subjectChapter Summarizationen
dc.titleChapter Summarizationen
dc.typePresentationen
dc.typeReporten
dc.typeOtheren

Files

Original bundle
Now showing 1 - 5 of 6
Name:
ETDsBatch1.xlsx
Size:
12.33 KB
Format:
Microsoft Excel XML
Name:
ETDsBatch2.docx
Size:
122.15 KB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
ChapterSummarizationPresentation.pdf
Size:
1.52 MB
Format:
Adobe Portable Document Format
Name:
ChapterSummarizationPresentation.pptx
Size:
3.66 MB
Format:
Microsoft Powerpoint XML
Name:
ChapterSummarizationReport.docx
Size:
2.3 MB
Format:
Microsoft Word XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: