Extracting Named Entities Using Named Entity Recognizer and Generating Topics Using Latent Dirichlet Allocation Algorithm for Arabic News Articles

dc.contributor.authorKanan, Tareken
dc.contributor.authorAyoub, Souleimanen
dc.contributor.authorSaif, Eyaden
dc.contributor.authorKanaan, Ghassanen
dc.contributor.authorChandrasekarar, Prashanten
dc.contributor.authorFox, Edward A.en
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2015-04-27T11:25:59Zen
dc.date.available2015-04-27T11:25:59Zen
dc.date.issued2015en
dc.description.abstractThis paper explains for the Arabic language, how to extract named entities and topics from news articles. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, we have built an Arabic NER (RenA) and an Arabic topic extraction tool using the popular LDA algorithm (ALDA). NER involves extracting information and identifying types, such as name, organization, and location. LDA works by applying statistical methods to vector representations of collections of documents. Though there are effective tools for NER and LDA for English, these are not directly applicable to Arabic. Accordingly, we developed new methods and tools (i.e., RenA and ALDA). To allow assessment of these, and comparison with other methods and tools, we built a baseline corpus to be used in NER evaluation, with help from volunteer graduate students who understand Arabic. RenA produces good results, with accurate Name, Organization, and Location extraction from news articles collected from online resources. We compared the RenA results with a popular Arabic NER, and achieved an enhancement. We also carried out an experiment to evaluate ALDA, again involving volunteer graduate students who understand Arabic. ALDA showed very good results in terms of topics extraction form Arabic news articles, achieving high accuracy, based on an experimental evaluation with participants using a Likert scale.en
dc.description.sponsorshipUS National Science Foundation grants DUE-1141209 and IIS-1319578en
dc.description.sponsorshipGrant # 4-029-1-007 from the Qatar National Research Fund (a member of Qatar Foundation)en
dc.format.mimetypeapplication/pdfen
dc.identifier.trnumberTR-15-02en
dc.identifier.urihttp://hdl.handle.net/10919/51822en
dc.language.isoenen
dc.publisherDepartment of Computer Science, Virginia Polytechnic Institute & State Universityen
dc.relation.ispartofComputer Science Technical Reportsen
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/en
dc.subjectArabic languageen
dc.subjectNamed entity recognizeren
dc.subjectTopic extractionen
dc.subjectLatent dirichlet allocationen
dc.subjectNatural language processing (Computer science)en
dc.titleExtracting Named Entities Using Named Entity Recognizer and Generating Topics Using Latent Dirichlet Allocation Algorithm for Arabic News Articlesen
dc.typeTechnical reporten
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ArabicNewsKanan201504.pdf
Size:
1.64 MB
Format:
Adobe Portable Document Format
Description:
PDF of the technical report
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: