Classifying ETDs


Electronic Theses and Dissertations (ETDs) are academic documents that provide an in-depth insight into an account of the research work of a graduate student and are designed to be stored in machine archives and retrieved globally. These documents contain abundant information that may be utilized by various machine learning tasks such as classification, summarization, and question-answering. However, these documents often have incomplete, incorrect, or inconsistent metadata which makes it challenging to accurately categorize these documents without manual intervention since there is no one uniform format to develop the metadata. Therefore, through the Classifying ETDs capstone project, we aim to create a gold standard classification dataset, leverage machine learning and deep learning algorithms to automatically classify ETDs with missing metadata, and develop a website to allow a user to classify an ETD with missing metadata and view already classified ETDs. The expected impact of this project is to advance information availability from long documents and eventually aid in improving long document information accessibility through regular search engines.



Gold Standard ETD Classification Dataset, Deep Learning, Text Classification Models, Interactive User Interface, Data Cleaning