Browsing by Author "Ayoub, Souleiman"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Arabic News Article SummarizationAyoub, Souleiman; Freeman, Julia (2015-05-14)This project involves taking Arabic PDF news articles to produce results from our new program that indexes, categorizes, and summarizes them. We fill out a template to summarize news articles with predetermined attributes. These values will be extracted using a named entity recognizer (NER) which will recognize organizations and people, topic generation using an LDA algorithm, and direct information extraction from news articles’ authors and dates. We use Fusion LucidWorks (a Solr based system) to help with the indexing of our data and provide an interface for the user to search and browse the articles with their summaries. Solr is used for information retrieval. The final program should enable end users to sift through news articles quickly.
- Exploring the Blacksburg Community Events CollectionAntol, Stanislaw; Ayoub, Souleiman; Folgar, Carlos; Smith, Steve (2014-12)With the advent of new technology, especially the combination of smart phones and widespread Internet access, people are increasingly becoming absorbed in digital worlds – worlds that are not bounded by geography. As such, some people worry about what this means for local communities. The Virtual Town Square project is an effort to harness people's use of these kinds of social networks, but with a focus on local communities. As part of the Fall 2014 CS4984 Computational Linguistics course, we explored a collection of documents, the Blacksburg Events Collection, that were mined from the Virtual Town Square for the town of Blacksburg, Virginia. We describe our activities to summarize this collection to inform newcomers about the local community. We begin by describing the approach that we took, which consisted of first cleaning our dataset and then applying the idea of Hierarchical Clustering to our collection. The core idea is to cluster the documents of our collection into sub-clusters, then cluster those sub-clusters, and then finally do sub-clustering on the sentences of the final sub-clusters. We then choose the sentences closest to the final sentence sub-cluster centroids as our summaries. Some of the summary sentences capture very relevant information about specific events in the community, but our final results still have a fair bit of noise and are not very concise. We then discuss some of the lessons that we learned throughout the course of the project, such as the importance of good project planning and quickly iterating on actual solutions instead of just discussing the multitude of approaches that can be taken. We then provide suggestions to improve upon our approach, especially ways to clean up the final sentence summaries. The appendix also contains a Developer’s Manual that describes the included files and the final code in detail.
- Extracting Named Entities Using Named Entity Recognizer and Generating Topics Using Latent Dirichlet Allocation Algorithm for Arabic News ArticlesKanan, Tarek; Ayoub, Souleiman; Saif, Eyad; Kanaan, Ghassan; Chandrasekarar, Prashant; Fox, Edward A. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015)This paper explains for the Arabic language, how to extract named entities and topics from news articles. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, we have built an Arabic NER (RenA) and an Arabic topic extraction tool using the popular LDA algorithm (ALDA). NER involves extracting information and identifying types, such as name, organization, and location. LDA works by applying statistical methods to vector representations of collections of documents. Though there are effective tools for NER and LDA for English, these are not directly applicable to Arabic. Accordingly, we developed new methods and tools (i.e., RenA and ALDA). To allow assessment of these, and comparison with other methods and tools, we built a baseline corpus to be used in NER evaluation, with help from volunteer graduate students who understand Arabic. RenA produces good results, with accurate Name, Organization, and Location extraction from news articles collected from online resources. We compared the RenA results with a popular Arabic NER, and achieved an enhancement. We also carried out an experiment to evaluate ALDA, again involving volunteer graduate students who understand Arabic. ALDA showed very good results in terms of topics extraction form Arabic news articles, achieving high accuracy, based on an experimental evaluation with participants using a Likert scale.