Qatar content classification
dc.contributor.author | Handosa, Mohamed | en |
dc.date.accessioned | 2014-05-09T13:57:58Z | en |
dc.date.available | 2014-05-09T13:57:58Z | en |
dc.date.issued | 2014-05-09 | en |
dc.description | Short title: Qatar content classification. Long title: Develop methods and software for classifying Arabic texts into a taxonomy using machine learning. Contact person and their contact information: Tarek Kanan, tarekk@vt.edu. Project description: Starting 4/1/2012, and running through 12/31/2015, is a project to advance digital libraries in the country of Qatar. This is led by VT, but also involves Penn State, Texas A&M, and Qatar University. Tarek is a GRA on this effort. His dissertation focuses on classifying Arabic texts into a taxonomy using machine learning. This will be done first for news, and then for other content areas. Project deliverables: Arabic collections, taxonomies, classifiers, and results from experiments to find the best methods. Support: Qatar National Research Fund Project No. NPRP 4-029-1-007 | en |
dc.description.abstract | This reports on a term project for the CS660 Digital libraries course (Spring 2014). The project has been conducted under the supervision of Prof. Edward Fox and Mr. Tarek Kanan. The goal is to develop an Arabic newspaper article classifier. We have built a collection of 700 Arabic newspaper articles and 1700 Arabic full-newspaper PDF files. A stemmer, named “P-Stemmer”, is proposed. Evaluation have shown that P-Stemmer outperforms Larkey’s widely used light stemmer. Several classification techniques were tested on Arabic data including SVM, Naïve Bayes and Random Forest. We built and tested 21 multiclass classifiers, 15 binary classifiers, and 5 compound classifiers using the voting technique. Finally, we uploaded the classified instances to Apache Solr for searching and indexing purposes. | en |
dc.description.sponsorship | Tarek Kanan | en |
dc.description.sponsorship | Support has been provided through Qatar National Research Fund Project No. NPRP 4-029-1-007 | en |
dc.identifier.uri | http://hdl.handle.net/10919/47934 | en |
dc.language.iso | en_US | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | Qatar | en |
dc.subject | Classification | en |
dc.subject | SOLR | en |
dc.subject | Weka | en |
dc.subject | Arabic | en |
dc.subject | Machine learning | en |
dc.title | Qatar content classification | en |
dc.type | Presentation | en |
dc.type | Technical report | en |
Files
Original bundle
1 - 5 of 6
Loading...
- Name:
- Final Presentation.pdf
- Size:
- 747.87 KB
- Format:
- Adobe Portable Document Format
- Description:
- Final presentation of the project in PDF format
- Name:
- Final Presentation.pptx
- Size:
- 548.46 KB
- Format:
- Microsoft Powerpoint XML
- Description:
- Final presentation of the project in PPTX format
- Name:
- Final Report.docx
- Size:
- 1.42 MB
- Format:
- Microsoft Word XML
- Description:
- Final report of the project in DOCX format
Loading...
- Name:
- Final Report.pdf
- Size:
- 2.36 MB
- Format:
- Adobe Portable Document Format
- Description:
- Final report of the project in PDF format
Loading...
- Name:
- Midterm Presentation.pdf
- Size:
- 414.61 KB
- Format:
- Adobe Portable Document Format
- Description:
- Midterm presentation of the project in PDF format
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: