Extracting Named Entities Using Named Entity Recognizer and Generating Topics Using Latent Dirichlet Allocation Algorithm for Arabic News Articles

TR Number
TR-15-02
Date
2015
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science, Virginia Polytechnic Institute & State University
Abstract

This paper explains for the Arabic language, how to extract named entities and topics from news articles. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, we have built an Arabic NER (RenA) and an Arabic topic extraction tool using the popular LDA algorithm (ALDA). NER involves extracting information and identifying types, such as name, organization, and location. LDA works by applying statistical methods to vector representations of collections of documents. Though there are effective tools for NER and LDA for English, these are not directly applicable to Arabic. Accordingly, we developed new methods and tools (i.e., RenA and ALDA). To allow assessment of these, and comparison with other methods and tools, we built a baseline corpus to be used in NER evaluation, with help from volunteer graduate students who understand Arabic. RenA produces good results, with accurate Name, Organization, and Location extraction from news articles collected from online resources. We compared the RenA results with a popular Arabic NER, and achieved an enhancement. We also carried out an experiment to evaluate ALDA, again involving volunteer graduate students who understand Arabic. ALDA showed very good results in terms of topics extraction form Arabic news articles, achieving high accuracy, based on an experimental evaluation with participants using a Likert scale.

Description
Keywords
Arabic language, Named entity recognizer, Topic extraction, Latent dirichlet allocation, Natural language processing (Computer science)
Citation