Qatar content classification

dc.contributor.authorHandosa, Mohameden
dc.date.accessioned2014-05-09T13:57:58Zen
dc.date.available2014-05-09T13:57:58Zen
dc.date.issued2014-05-09en
dc.descriptionShort title: Qatar content classification. Long title: Develop methods and software for classifying Arabic texts into a taxonomy using machine learning. Contact person and their contact information: Tarek Kanan, tarekk@vt.edu. Project description: Starting 4/1/2012, and running through 12/31/2015, is a project to advance digital libraries in the country of Qatar. This is led by VT, but also involves Penn State, Texas A&M, and Qatar University. Tarek is a GRA on this effort. His dissertation focuses on classifying Arabic texts into a taxonomy using machine learning. This will be done first for news, and then for other content areas. Project deliverables: Arabic collections, taxonomies, classifiers, and results from experiments to find the best methods. Support: Qatar National Research Fund Project No. NPRP 4-029-1-007en
dc.description.abstractThis reports on a term project for the CS660 Digital libraries course (Spring 2014). The project has been conducted under the supervision of Prof. Edward Fox and Mr. Tarek Kanan. The goal is to develop an Arabic newspaper article classifier. We have built a collection of 700 Arabic newspaper articles and 1700 Arabic full-newspaper PDF files. A stemmer, named “P-Stemmer”, is proposed. Evaluation have shown that P-Stemmer outperforms Larkey’s widely used light stemmer. Several classification techniques were tested on Arabic data including SVM, Naïve Bayes and Random Forest. We built and tested 21 multiclass classifiers, 15 binary classifiers, and 5 compound classifiers using the voting technique. Finally, we uploaded the classified instances to Apache Solr for searching and indexing purposes.en
dc.description.sponsorshipTarek Kananen
dc.description.sponsorshipSupport has been provided through Qatar National Research Fund Project No. NPRP 4-029-1-007en
dc.identifier.urihttp://hdl.handle.net/10919/47934en
dc.language.isoen_USen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectQataren
dc.subjectClassificationen
dc.subjectSOLRen
dc.subjectWekaen
dc.subjectArabicen
dc.subjectMachine learningen
dc.titleQatar content classificationen
dc.typePresentationen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 6
Loading...
Thumbnail Image
Name:
Final Presentation.pdf
Size:
747.87 KB
Format:
Adobe Portable Document Format
Description:
Final presentation of the project in PDF format
Name:
Final Presentation.pptx
Size:
548.46 KB
Format:
Microsoft Powerpoint XML
Description:
Final presentation of the project in PPTX format
Name:
Final Report.docx
Size:
1.42 MB
Format:
Microsoft Word XML
Description:
Final report of the project in DOCX format
Loading...
Thumbnail Image
Name:
Final Report.pdf
Size:
2.36 MB
Format:
Adobe Portable Document Format
Description:
Final report of the project in PDF format
Loading...
Thumbnail Image
Name:
Midterm Presentation.pdf
Size:
414.61 KB
Format:
Adobe Portable Document Format
Description:
Midterm presentation of the project in PDF format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: