Qatar content classification

Abstract

This reports on a term project for the CS660 Digital libraries course (Spring 2014). The project has been conducted under the supervision of Prof. Edward Fox and Mr. Tarek Kanan. The goal is to develop an Arabic newspaper article classifier. We have built a collection of 700 Arabic newspaper articles and 1700 Arabic full-newspaper PDF files. A stemmer, named “P-Stemmer”, is proposed. Evaluation have shown that P-Stemmer outperforms Larkey’s widely used light stemmer. Several classification techniques were tested on Arabic data including SVM, Naïve Bayes and Random Forest. We built and tested 21 multiclass classifiers, 15 binary classifiers, and 5 compound classifiers using the voting technique. Finally, we uploaded the classified instances to Apache Solr for searching and indexing purposes.

Description

Short title: Qatar content classification. Long title: Develop methods and software for classifying Arabic texts into a taxonomy using machine learning. Contact person and their contact information: Tarek Kanan, tarekk@vt.edu. Project description: Starting 4/1/2012, and running through 12/31/2015, is a project to advance digital libraries in the country of Qatar. This is led by VT, but also involves Penn State, Texas A&M, and Qatar University. Tarek is a GRA on this effort. His dissertation focuses on classifying Arabic texts into a taxonomy using machine learning. This will be done first for news, and then for other content areas. Project deliverables: Arabic collections, taxonomies, classifiers, and results from experiments to find the best methods. Support: Qatar National Research Fund Project No. NPRP 4-029-1-007

Keywords

Qatar, Classification, SOLR, Weka, Arabic, Machine learning

Citation