Chapter Classification and Summarization


The US corpus of Electronic Theses and Dissertations (ETDs), partly captured in our research collection numbering over 500,000, is a valuable resource for education and research. Unfortunately, as the average length of these documents is around 100 pages, finding specific research information is not a simple task. Our project aims to tackle this issue by segmenting our sample of 500,000 ETDs, and providing a web interface that provides users with an application that summarizes individual chapters from the previously segmented sample.

The first step of the project was to verify that the automatic segmentation process, performed in advance by our client, could be relied upon. This required each team member to analyze 50 segmented documents and verify their integrity by confirming that each chapter was correctly identified and separated into a PDF. During this process, we noted any peculiarities, to identify recurring issues and improve the segmentation process. The rest of our time and effort went into creating an efficient web interface that would allow users to upload ETD chapters and display said chapter’s summary and classification results.

We were able to complete a web interface that allows a user to upload an ETD chapter PDF from the sampled ETD database and view the summary of the PDF along with all of the metadata (author, title, publication date, etc.) of the associated ETD. Additionally, the group verified approximately 60 of the automatically segmented documents and detailed any errors or peculiarities thoroughly. Our group delivered both the web interface as a GitHub repository and an Excel spreadsheet detailing the complete results of our segmentation verification process.

The interface was designed to be used in aiding research on ETDs. Although this application won’t be available publicly, researchers may use it privately to assist with any ETD research projects they participate in.

The web interface uses Streamlit, which is a Python framework for web development. This was the first time anyone in the group had used Streamlit, so we had to learn each feature that we used, which caused quite a few issues. However, quickly searching and accessing the metadata database, which was originally an Excel sheet with 500,000 entries, posed the biggest threat to the usability of our interface. Luckily, we were able to solve all issues through the use of API documentation, our client, Bipasha Banerjee, and our extremely helpful instructor, Professor Edward A. Fox.

In terms of technical skills, we have learned how to operate a Streamlit web interface as well as how to use MySQL. However, we also learned a few life lessons. Firstly, do not use the first tool available when attempting to solve a solution. It is wise to take extra time to search for the best tool for a given situation instead of wasting time compensating for using the wrong tool. Secondly, life happens without regard and without warning, but the best move is to reanalyze the situation and push forward to complete the work that must be done.


ChapterClassSummReport.pdf - PDF version of the final report. ChapterClassSummReport.docx - Word version of the final report. ChapterClassSummPresentation.pptx - PowerPoint version of the final presentation. ChapterClassSummPresentation.pdf - PDF version of the final presentation. - All of the files created and used in the running of this project.


Theses, Electronic Theses and Dissertations, ETD, ETDs, Summarization, Classification, Dissertations, ChapterClassSumm