Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification

Abdelrahman, Eman; Alotaibi, Fatimah; Fox, Edward A.; Balci, Osman

Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification

dc.contributor.author	Abdelrahman, Eman	en
dc.contributor.author	Alotaibi, Fatimah	en
dc.contributor.author	Fox, Edward A.	en
dc.contributor.author	Balci, Osman	en
dc.date.accessioned	2022-02-24T18:27:15Z	en
dc.date.available	2022-02-24T18:27:15Z	en
dc.date.issued	2021-03-01	en
dc.date.updated	2022-02-24T18:27:13Z	en
dc.description.abstract	Although the Arabic language is spoken by more than 300 million people and is one of the six official languages of the United Nations (UN), there has been less research done on Arabic text data (compared to English) in the realm of machine learning, especially in text classification. In the past decade, Arabic data such as news, tweets, etc. have begun to receive some attention. In contrast, Arabic Electronic Theses and Dissertations (ETDs) have received little attention, in spite of the huge number of benefits they provide to students, universities, and future generations of scholars. There are two main roadblocks to performing automatic subject classification on Arabic ETDs, which could be helpful for discovery and browsing. The first is the unavailability of a public corpus of Arabic ETDs. The second is the linguistic complexity of the Arabic language; that complexity is particularly evident in academic documents such as ETDs. To address these roadblocks, this paper presents Otrouha, a framework for automatic subject classification of Arabic ETDs, which has two main goals. The first is building a Corpus of Arabic ETDs and their key metadata such as abstracts, keywords, and title, to pave the way for more exploratory research on this valuable genre. The second is to provide a framework for automatic subject classification of Arabic ETDs through different classification models that use classical machine learning as well as deep learning techniques. The first goal is aided by searching the AskZad Digital Library, which is part of the Saudi Digital Library (SDL). AskZad provides other key metadata of Arabic ETDs, such as abstract, title, and keywords. The current search results consist of abstracts of Arabic ETDs. This raw data then undergoes a pre-processing phase that includes stop word removal using the Natural Language Tool Kit (NLTK), and word lemmatization using the Farasa API. To date, abstracts of 518 ETDs across 12 subjects have been collected. For the second goal, the preliminary results show that among the machine learning models, binary classification (one-vs.-all) performed better than multiclass classification. The maximum per subject accuracy is 95%, with an average accuracy of 68% across all subjects. It is noteworthy that the binary classification model performed better for some categories than others. For example, Applied Science and Technology shows 95% accuracy, while the category of Administration shows 36%. Deep learning models resulted in higher accuracy but lower F-measure; their overall performance is lower than machine learning models. This may be due to the small size of the dataset as well as the imbalance in the number of documents per category. Work to collect additional ETDs will be aided by collaborative contributions of data from additional sources.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier	6 (Article number)	en
dc.identifier.doi	https://doi.org/10.52407/YNZB1163	en
dc.identifier.orcid	Balci, Osman [0000-0002-2965-3035]	en
dc.identifier.uri	http://hdl.handle.net/10919/108851	en
dc.identifier.volume	1	en
dc.language.iso	en	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.title	Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification	en
dc.title.serial	The Journal of Electronic Theses and Dissertations	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en
dc.type.other	Article	en
pubs.organisational-group	/Virginia Tech	en
pubs.organisational-group	/Virginia Tech/Engineering	en
pubs.organisational-group	/Virginia Tech/Engineering/Computer Science	en
pubs.organisational-group	/Virginia Tech/All T&R Faculty	en
pubs.organisational-group	/Virginia Tech/Engineering/COE T&R Faculty	en
pubs.organisational-group	/Virginia Tech/Report test	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Otrouha_ A Corpus of Arabic ETDs and a Framework for Automatic Su.pdf
Size:: 447.8 KB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

All Faculty Deposits
Scholarly Works, Computer Science