Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)

Kan'an, Tarek Ghaze

Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)

dc.contributor.author	Kan'an, Tarek Ghaze	en
dc.contributor.committeechair	Fox, Edward A.	en
dc.contributor.committeemember	Al-Shalabi, Riyad	en
dc.contributor.committeemember	Fan, Weiguo	en
dc.contributor.committeemember	Shaffer, Clifford A.	en
dc.contributor.committeemember	Ehrich, Roger W.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2017-01-12T07:00:33Z	en
dc.date.available	2017-01-12T07:00:33Z	en
dc.date.issued	2015-07-21	en
dc.description.abstract	Arabic news articles in heterogeneous electronic collections are difficult for users to work with. Two problems are: that they are not categorized in a way that would aid browsing, and that there are no summaries or detailed metadata records that could be easier to work with than full articles. To address the first problem, schema mapping techniques were adapted to construct a simple taxonomy for Arabic news stories that is compatible with the subject codes of the International Press Telecommunications Council. So that each article would be labeled with the proper taxonomy category, automatic classification methods were researched, to identify the most appropriate. Experiments showed that the best features to use in classification resulted from a new tailored stemming approach (i.e., a new Arabic light stemmer called P-Stemmer). When coupled with binary classification using SVM, the newly developed approach proved to be superior to state-of-the-art techniques. To address the second problem, i.e., summarization, preliminary work was done with English corpora. This was in the context of a new Problem Based Learning (PBL) course wherein students produced template summaries of big text collections. The techniques used in the course were extended to work with Arabic news. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, two new tools were constructed: RenA for Arabic NER, and ALDA for Arabic topic extraction tool (using the Latent Dirichlet Algorithm). Controlled experiments with each of RenA and ALDA, involving Arabic speakers and a randomly selected corpus of 1000 Qatari news articles, showed the tools produced very good results (i.e., names, organizations, locations, and topics). Then the categorization, NER, topic identification, and additional information extraction techniques were combined to produce approximately 120,000 summaries for Qatari news articles, which are searchable, along with the articles, using LucidWorks Fusion, which builds upon Solr software. Evaluation of the summaries showed high ratings based on the 1000-article test corpus. Contributions of this research with Arabic news articles thus include a new: test corpus, taxonomy, light stemmer, classification approach, NER tool, topic identification tool, and template-based summarizer – all shown through experimentation to be highly effective.	en
dc.description.degree	Ph. D.	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:5397	en
dc.identifier.uri	http://hdl.handle.net/10919/74272	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Classification	en
dc.subject	Summarization	en
dc.subject	Arabic Language	en
dc.subject	Natural Language Processing	en
dc.subject	Digital Libraries	en
dc.title	Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kan_an_TG_D_2015.pdf
Size:: 11.25 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations