Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification

Files

TR Number

Date

2021-03-01

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Although the Arabic language is spoken by more than 300 million people and is one of the six official languages of the United Nations (UN), there has been less research done on Arabic text data (compared to English) in the realm of machine learning, especially in text classification. In the past decade, Arabic data such as news, tweets, etc. have begun to receive some attention. In contrast, Arabic Electronic Theses and Dissertations (ETDs) have received little attention, in spite of the huge number of benefits they provide to students, universities, and future generations of scholars. There are two main roadblocks to performing automatic subject classification on Arabic ETDs, which could be helpful for discovery and browsing. The first is the unavailability of a public corpus of Arabic ETDs. The second is the linguistic complexity of the Arabic language; that complexity is particularly evident in academic documents such as ETDs. To address these roadblocks, this paper presents Otrouha, a framework for automatic subject classification of Arabic ETDs, which has two main goals. The first is building a Corpus of Arabic ETDs and their key metadata such as abstracts, keywords, and title, to pave the way for more exploratory research on this valuable genre. The second is to provide a framework for automatic subject classification of Arabic ETDs through different classification models that use classical machine learning as well as deep learning techniques. The first goal is aided by searching the AskZad Digital Library, which is part of the Saudi Digital Library (SDL). AskZad provides other key metadata of Arabic ETDs, such as abstract, title, and keywords. The current search results consist of abstracts of Arabic ETDs. This raw data then undergoes a pre-processing phase that includes stop word removal using the Natural Language Tool Kit (NLTK), and word lemmatization using the Farasa API. To date, abstracts of 518 ETDs across 12 subjects have been collected. For the second goal, the preliminary results show that among the machine learning models, binary classification (one-vs.-all) performed better than multiclass classification. The maximum per subject accuracy is 95%, with an average accuracy of 68% across all subjects. It is noteworthy that the binary classification model performed better for some categories than others. For example, Applied Science and Technology shows 95% accuracy, while the category of Administration shows 36%. Deep learning models resulted in higher accuracy but lower F-measure; their overall performance is lower than machine learning models. This may be due to the small size of the dataset as well as the imbalance in the number of documents per category. Work to collect additional ETDs will be aided by collaborative contributions of data from additional sources.

Description

Keywords

Citation