Automated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy

dc.contributor.authorKanan, Tareken
dc.contributor.authorFox, Edward A.en
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2015-02-06T15:27:52Zen
dc.date.available2015-02-06T15:27:52Zen
dc.date.issued2015-01-22en
dc.description.abstractArabic news articles in electronic collections are difficult to work with. Browsing by category is rarely supported. While helpful machine learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a QNRF funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237K Arabic news articles, which should be applicable to other Arabic news collections as well. We designed a simple taxonomy for Arabic news stories that is suitable for the needs in Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic-speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer) and automatic classification methods (the best being binary SVM classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10-fold cross-validation and the Wilcoxon signed-rank test, we showed that our approach to stemming and classification is superior to state-of-the-art techniques.en
dc.format.mimetypeapplication/pdfen
dc.identifier.trnumberTR-15-01en
dc.identifier.urihttp://hdl.handle.net/10919/51269en
dc.language.isoenen
dc.publisherDepartment of Computer Science, Virginia Polytechnic Institute & State Universityen
dc.relation.ispartofComputer Science Technical Reportsen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectData and text miningen
dc.subjectDigital librariesen
dc.subjectInformation retrievalen
dc.subjectMachine learningen
dc.titleAutomated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomyen
dc.typeTechnical reporten
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KananFoxArabicTextClassification.pdf
Size:
1.9 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: