Browsing by Author "Kanan, Tarek"
Now showing 1 - 11 of 11
Results Per Page
Sort Options
- Automated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article TaxonomyKanan, Tarek; Fox, Edward A. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-01-22)Arabic news articles in electronic collections are difficult to work with. Browsing by category is rarely supported. While helpful machine learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a QNRF funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237K Arabic news articles, which should be applicable to other Arabic news collections as well. We designed a simple taxonomy for Arabic news stories that is suitable for the needs in Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic-speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer) and automatic classification methods (the best being binary SVM classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10-fold cross-validation and the Wilcoxon signed-rank test, we showed that our approach to stemming and classification is superior to state-of-the-art techniques.
- CrawlingFox, Edward A.; Khandeparker, Ashwin S. (2012-11-28)This module covers the basic concepts of Web crawling, policies, techniques and how these can be applied to Digital Libraries.
- Extracting Named Entities Using Named Entity Recognizer and Generating Topics Using Latent Dirichlet Allocation Algorithm for Arabic News ArticlesKanan, Tarek; Ayoub, Souleiman; Saif, Eyad; Kanaan, Ghassan; Chandrasekarar, Prashant; Fox, Edward A. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015)This paper explains for the Arabic language, how to extract named entities and topics from news articles. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, we have built an Arabic NER (RenA) and an Arabic topic extraction tool using the popular LDA algorithm (ALDA). NER involves extracting information and identifying types, such as name, organization, and location. LDA works by applying statistical methods to vector representations of collections of documents. Though there are effective tools for NER and LDA for English, these are not directly applicable to Arabic. Accordingly, we developed new methods and tools (i.e., RenA and ALDA). To allow assessment of these, and comparison with other methods and tools, we built a baseline corpus to be used in NER evaluation, with help from volunteer graduate students who understand Arabic. RenA produces good results, with accurate Name, Organization, and Location extraction from news articles collected from online resources. We compared the RenA results with a popular Arabic NER, and achieved an enhancement. We also carried out an experiment to evaluate ALDA, again involving volunteer graduate students who understand Arabic. ALDA showed very good results in terms of topics extraction form Arabic news articles, achieving high accuracy, based on an experimental evaluation with participants using a Likert scale.
- Image RetrievalKuppuswami, Nagarajan; Fox, Edward A. (2009-12-09)The module covers basic explanation of Image Retrieval, various techniques used, and its working in existing systems.
- Information Retrieval System EvaluationWei, Shiyi; Suwardiman, Victoria; Swaminathan, Anand (2012-10-03)The module introduces the evaluation in information retrieval. It focuses on the standard measurement of system effectiveness through relevance judgments.
- LucidWorks: Advanced Searching cURLMakkapati, Hemanth; Subbiah, Rajesh; Kaw, Rushi (2012-10-07)This module focuses on advanced search techniques using Apache Solr through cURL. Successful completion of this module will enable students to employ advanced search techniques based on multi-values, multi-fields, phrase queries, query term proximity, boosting, etc. Also, students will be able to sort and display returned results in various ways.
- PreservationFox, Edward A.; Kanan, Tarek (2009-12-13)This module covers the general ideas, strategies and challenges for the long-term preservation of digital information.
- Relevance Feedback and Query ExpansionWu, Sichao; Zhang, Yao (2012-10-17)This module introduces the methods to improve the recall of information retrieval systems, mainly focuses on relevance feedback and query expansion.
- Text Classification Using MahoutAlam, Maksudul; Arifuzzaman, S. M.; Bhuiyan, Md Hasanuzzaman (2012-11-06)This module focuses on classification of text using Apache Mahout. After successful completion of this module, students will be able to explain and apply methods of classification, correctly classify a set of documents using Apache Mahout, and construct and apply workflows for text classification using Apache Mahout.
- Text Clustering Using LucidWorks and Apache MahoutChen, Liangzhe; Lin, Xiao; Wood, Andrew (2012-11-17)This module introduces algorithms and evaluation metrics for flat clustering. We focus on the usage of LucidWorks big data analysis software and Apache Mahout, an open source machine learning library in clustering of document collections with the k-means algorithm.
- Web ArchivingLee, Spencer; Kanan, Tarek; Jiao, Jian (2009-10-09)This module covers the ideas, approaches, problems and needs of web archiving to build a static and long term collection consisting of web pages.