Chapter Classification and Summarization

dc.contributor.authorJackson, Milesen
dc.contributor.authorZhao, Yinhjieen
dc.date.accessioned2024-05-09T17:52:44Zen
dc.date.available2024-05-09T17:52:44Zen
dc.date.issued2024-05-07en
dc.descriptionChapterClassSummReport.pdf - PDF version of the final report. ChapterClassSummReport.docx - Word version of the final report. ChapterClassSummPresentation.pptx - PowerPoint version of the final presentation. ChapterClassSummPresentation.pdf - PDF version of the final presentation. ChapterClassSummCode.zip - All of the files created and used in the running of this project.en
dc.description.abstractThe US corpus of Electronic Theses and Dissertations (ETDs), partly captured in our research collection numbering over 500,000, is a valuable resource for education and research. Unfortunately, as the average length of these documents is around 100 pages, finding specific research information is not a simple task. Our project aims to tackle this issue by segmenting our sample of 500,000 ETDs, and providing a web interface that provides users with an application that summarizes individual chapters from the previously segmented sample. The first step of the project was to verify that the automatic segmentation process, performed in advance by our client, could be relied upon. This required each team member to analyze 50 segmented documents and verify their integrity by confirming that each chapter was correctly identified and separated into a PDF. During this process, we noted any peculiarities, to identify recurring issues and improve the segmentation process. The rest of our time and effort went into creating an efficient web interface that would allow users to upload ETD chapters and display said chapter’s summary and classification results. We were able to complete a web interface that allows a user to upload an ETD chapter PDF from the sampled ETD database and view the summary of the PDF along with all of the metadata (author, title, publication date, etc.) of the associated ETD. Additionally, the group verified approximately 60 of the automatically segmented documents and detailed any errors or peculiarities thoroughly. Our group delivered both the web interface as a GitHub repository and an Excel spreadsheet detailing the complete results of our segmentation verification process. The interface was designed to be used in aiding research on ETDs. Although this application won’t be available publicly, researchers may use it privately to assist with any ETD research projects they participate in. The web interface uses Streamlit, which is a Python framework for web development. This was the first time anyone in the group had used Streamlit, so we had to learn each feature that we used, which caused quite a few issues. However, quickly searching and accessing the metadata database, which was originally an Excel sheet with 500,000 entries, posed the biggest threat to the usability of our interface. Luckily, we were able to solve all issues through the use of API documentation, our client, Bipasha Banerjee, and our extremely helpful instructor, Professor Edward A. Fox. In terms of technical skills, we have learned how to operate a Streamlit web interface as well as how to use MySQL. However, we also learned a few life lessons. Firstly, do not use the first tool available when attempting to solve a solution. It is wise to take extra time to search for the best tool for a given situation instead of wasting time compensating for using the wrong tool. Secondly, life happens without regard and without warning, but the best move is to reanalyze the situation and push forward to complete the work that must be done.en
dc.description.sponsorshipMs. Bipasha Banerjee Dr. Edward A. Foxen
dc.identifier.urihttps://hdl.handle.net/10919/118935en
dc.language.isoenen
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en
dc.subjectThesesen
dc.subjectElectronic Theses and Dissertationsen
dc.subjectETDen
dc.subjectETDsen
dc.subjectSummarizationen
dc.subjectClassificationen
dc.subjectDissertationsen
dc.subjectChapterClassSummen
dc.titleChapter Classification and Summarizationen
dc.typeReporten
dc.typePresentationen
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
ChapterClassSummCode.zip
Size:
16.78 KB
Format:
Description:
All of the files created and used in the running of this project.
Loading...
Thumbnail Image
Name:
ChapterClassSummReport.pdf
Size:
1.21 MB
Format:
Adobe Portable Document Format
Name:
ChapterClassSummReport.docx
Size:
1.46 MB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
ChapterClassSummPresentation.pdf
Size:
1.84 MB
Format:
Adobe Portable Document Format
Name:
ChapterClassSummPresentation.pptx
Size:
3.37 MB
Format:
Microsoft Powerpoint XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: