LDA Team Project in CS5604, Spring 2015: Extracting Topics from Tweets and Webpages for IDEAL
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
IDEAL or Integrated Digital Event Archiving and Library is a project of Virginia Tech to implement a state-of-the-art event-based information retrieval system. A practice project of CS 5604 Information Retrieval is a part of the IDEAL project. The main objective of this project is to build a robust search engine on top of Solr, a general purpose open-source search engine, and Hadoop, a big data processing platform. The search engine can provide documents, which are tweets and webpages, that are relevant to a query that a user provides. To enhance the performance of the search engine, the documents in the archive have been indexed by various approaches including LDA (Latent Dirichlet Allocation), NER (Name-Entity Recognition), Clustering, Classification, and Social Network Analysis. As CS 5604 is a problem-based learning class, teams are responsible for implementation and development of solutions for each technique. In this report, the implementation of the LDA component is presented. LDA aids extracting collections of topics from the documents. A topic in this context is a set of words that can be used to represent a document. Details of how LDA worked with both small and large collections are described. Once the implementation of the LDA component is integrated with other processing and Solr, we are confident that performance of the information retrieval system of the IDEAL project will be enhanced.