Browsing by Author "Alotaibi, Fatimah"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject ClassificationAbdelrahman, Eman; Alotaibi, Fatimah; Fox, Edward A.; Balci, Osman (2021-03-01)Although the Arabic language is spoken by more than 300 million people and is one of the six official languages of the United Nations (UN), there has been less research done on Arabic text data (compared to English) in the realm of machine learning, especially in text classification. In the past decade, Arabic data such as news, tweets, etc. have begun to receive some attention. In contrast, Arabic Electronic Theses and Dissertations (ETDs) have received little attention, in spite of the huge number of benefits they provide to students, universities, and future generations of scholars. There are two main roadblocks to performing automatic subject classification on Arabic ETDs, which could be helpful for discovery and browsing. The first is the unavailability of a public corpus of Arabic ETDs. The second is the linguistic complexity of the Arabic language; that complexity is particularly evident in academic documents such as ETDs. To address these roadblocks, this paper presents Otrouha, a framework for automatic subject classification of Arabic ETDs, which has two main goals. The first is building a Corpus of Arabic ETDs and their key metadata such as abstracts, keywords, and title, to pave the way for more exploratory research on this valuable genre. The second is to provide a framework for automatic subject classification of Arabic ETDs through different classification models that use classical machine learning as well as deep learning techniques. The first goal is aided by searching the AskZad Digital Library, which is part of the Saudi Digital Library (SDL). AskZad provides other key metadata of Arabic ETDs, such as abstract, title, and keywords. The current search results consist of abstracts of Arabic ETDs. This raw data then undergoes a pre-processing phase that includes stop word removal using the Natural Language Tool Kit (NLTK), and word lemmatization using the Farasa API. To date, abstracts of 518 ETDs across 12 subjects have been collected. For the second goal, the preliminary results show that among the machine learning models, binary classification (one-vs.-all) performed better than multiclass classification. The maximum per subject accuracy is 95%, with an average accuracy of 68% across all subjects. It is noteworthy that the binary classification model performed better for some categories than others. For example, Applied Science and Technology shows 95% accuracy, while the category of Administration shows 36%. Deep learning models resulted in higher accuracy but lower F-measure; their overall performance is lower than machine learning models. This may be due to the small size of the dataset as well as the imbalance in the number of documents per category. Work to collect additional ETDs will be aided by collaborative contributions of data from additional sources.
- Otrouha: Automatic Classification of Arabic ETDsAlotaibi, Fatimah; Abdelrahman, Eman (Virginia Tech, 2020-01-23)ETDs are becoming a new genre of documents that is highly precious and worth preserving. This has resulted in a sustainable need to build an effective tool to facilitate retrieving ETD collections. While Arabic ETDs have gained increasing attention, many challenges ensued due to lack of resources and complexity of information retrieval in the Arabic language. Therefore, this project focuses on making Arabic ETDs more accessible by facilitating browsing and searching. The aim is to build an automated classifier that categorizes an Arabic ETD based on its abstract. Our raw dataset was obtained by crawling the AskZad digital library website. Then, we conducted some pre-processing techniques on the dataset to make it suitable for our classification process. We developed automatic classification methods using various classifiers: Support Vector Machines and SVC, Random Forest, and Decision Trees. We then used an ensemble classifier of the two classifiers that generated the highest accuracy. Then, we applied evaluation techniques commonly used such as including 10-fold cross-validation. The results show better performance for the binary classification with average accuracy 68%per category, where multiclass classification performed poorly with average accuracy 24%.