Otrouha: Automatic Classification of Arabic ETDs

TR Number

Date

2020-01-23

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

ETDs are becoming a new genre of documents that is highly precious and worth preserving. This has resulted in a sustainable need to build an effective tool to facilitate retrieving ETD collections. While Arabic ETDs have gained increasing attention, many challenges ensued due to lack of resources and complexity of information retrieval in the Arabic language. Therefore, this project focuses on making Arabic ETDs more accessible by facilitating browsing and searching. The aim is to build an automated classifier that categorizes an Arabic ETD based on its abstract. Our raw dataset was obtained by crawling the AskZad digital library website. Then, we conducted some pre-processing techniques on the dataset to make it suitable for our classification process. We developed automatic classification methods using various classifiers: Support Vector Machines and SVC, Random Forest, and Decision Trees. We then used an ensemble classifier of the two classifiers that generated the highest accuracy. Then, we applied evaluation techniques commonly used such as including 10-fold cross-validation. The results show better performance for the binary classification with average accuracy 68%per category, where multiclass classification performed poorly with average accuracy 24%.

Description

Keywords

Arabic ETDs, Arabic Text Classification, Machine learning, NLP

Citation