Improving the Accessibility of Arabic Electronic Theses and Dissertations (ETDs) with Metadata and Classification

dc.contributor.authorAbdelrahman, Emanen
dc.contributor.committeechairBalci, Osmanen
dc.contributor.committeecochairFox, Edward A.en
dc.contributor.committeememberBarkhi, Rezaen
dc.contributor.departmentComputer Science and Applicationsen
dc.date.accessioned2022-01-19T19:34:28Zen
dc.date.available2022-01-19T19:34:28Zen
dc.date.issued2021en
dc.description.abstractMuch research work has been done to extract data from scientific papers, journals, and articles. However, Electronic Theses and Dissertations (ETDs) remain an unexplored genre of data in the research fields of natural language processing and machine learning. Moreover, much of the related research involved data that is in the English language. Arabic data such as news and tweets have begun to receive some attention in the past decade. However, Arabic ETDs remain an untapped source of data despite the vast number of benefits to students and future generations of scholars. Some ways of improving the browsability and accessibility of data include data annotation, indexing, parsing, translation, and classification. Classification is essential for the searchability and management of data, which can be manual or automated. The latter is beneficial when handling growing volumes of data. There are two main roadblocks to performing automatic subject classification on Arabic ETDs. The first is the unavailability of a public corpus of Arabic ETDs. The second is the Arabic language’s linguistic complexity, especially in academic documents. This research presents the Otrouha project, which aims at building a corpus of key metadata of Arabic ETDs as well as providing a methodology for their automatic subject classification. The first goal is aided by collecting data from the AskZad Digital Library. The second goal is achieved by exploring different machine learning and deep learning techniques. The experiments’ results show that deep learning using pretrained language models gave the highest classification performance, indicating that language models significantly contribute to natural language understanding.en
dc.description.abstractgeneralAn Electronic Thesis or Dissertation (ETD) is an openly-accessible electronic version of a graduate student’s research thesis or dissertation. It documents their main research effort that has taken place and becomes available in the University Library instead of a paper copy. Over time, collections of ETDs have been gathered and made available online through different digital libraries. ETDs are a valuable source of information for scholars and researchers, as well as librarians. With the digitalization move in most Middle Eastern Universities, the need to make Arabic ETDs more accessible significantly increases as their numbers increase. One of the ways to improve their accessibility and searchability is through providing automatic classification instead of manual classification. This thesis project focuses on building a corpus of metadata of Arabic ETDs and building a framework for their automatic subject classification. This is expected to pave the way for more exploratory research on this valuable genre of data.en
dc.description.degreeM.S.en
dc.format.mediumETDen
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttp://hdl.handle.net/10919/107790en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en
dc.subjectMachine learningen
dc.subjectNLPen
dc.subjectArabic Electronic Theses and Dissertations (ETDs)en
dc.subjectAutomatic Classificationen
dc.subjectDeep learning (Machine learning)en
dc.subjectPretrained Language Modelsen
dc.subjectDigital Librariesen
dc.titleImproving the Accessibility of Arabic Electronic Theses and Dissertations (ETDs) with Metadata and Classificationen
dc.typeThesisen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameM.S.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Abdelrahman_E_T_2021.pdf
Size:
1.72 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections