LDA Team Project in CS5604, Spring 2015: Extracting Topics from Tweets and Webpages for IDEAL

dc.contributor.authorPumma, Sarunyaen
dc.contributor.authorLiu, Xiaoyangen
dc.date.accessioned2015-05-15T04:06:51Zen
dc.date.available2015-05-15T04:06:51Zen
dc.date.issued2015-05-10en
dc.description.abstractIDEAL or Integrated Digital Event Archiving and Library is a project of Virginia Tech to implement a state-of-the-art event-based information retrieval system. A practice project of CS 5604 Information Retrieval is a part of the IDEAL project. The main objective of this project is to build a robust search engine on top of Solr, a general purpose open-source search engine, and Hadoop, a big data processing platform. The search engine can provide documents, which are tweets and webpages, that are relevant to a query that a user provides. To enhance the performance of the search engine, the documents in the archive have been indexed by various approaches including LDA (Latent Dirichlet Allocation), NER (Name-Entity Recognition), Clustering, Classification, and Social Network Analysis. As CS 5604 is a problem-based learning class, teams are responsible for implementation and development of solutions for each technique. In this report, the implementation of the LDA component is presented. LDA aids extracting collections of topics from the documents. A topic in this context is a set of words that can be used to represent a document. Details of how LDA worked with both small and large collections are described. Once the implementation of the LDA component is integrated with other processing and Solr, we are confident that performance of the information retrieval system of the IDEAL project will be enhanced.en
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en
dc.identifier.urihttp://hdl.handle.net/10919/52343en
dc.language.isoen_USen
dc.rightsCreative Commons Attribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/en
dc.subjectLDAen
dc.subjectIDEAL Projecten
dc.subjectTopic Extractionen
dc.subjectTweetsen
dc.subjectWebpagesen
dc.titleLDA Team Project in CS5604, Spring 2015: Extracting Topics from Tweets and Webpages for IDEALen
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
source_code.zip
Size:
5.13 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
LDAReport.pdf
Size:
920.28 KB
Format:
Adobe Portable Document Format
Description:
LDA Report PDF
Name:
LDAReport.pages
Size:
1.81 MB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
LDAPresentation.pdf
Size:
530.23 KB
Format:
Adobe Portable Document Format
Description:
LDA Presentation
Name:
LDAReport.docx
Size:
2.01 MB
Format:
Microsoft Word XML
Description:
LDA Report Report
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: