CS5604 Fall 2017 Clustering and Topic Analysis

dc.contributor.authorBaghudana, Ashishen
dc.contributor.authorAhuja, Amanen
dc.contributor.authorBellam, Pavanen
dc.contributor.authorChintha, Rammohanen
dc.contributor.authorSambaturu, Pratyushen
dc.contributor.authorMalpani, Ashishen
dc.contributor.authorShetty, Shrutien
dc.contributor.authorYang, Moen
dc.date.accessioned2018-01-13T17:45:04Zen
dc.date.available2018-01-13T17:45:04Zen
dc.date.issued2018-01-13en
dc.description.abstractOne of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summarize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algorithm and latent Dirichlet allocation (LDA) on different collections of webpages and tweets. Subsequently, we provide a developer manual to help set up our framework, and finally, outline a user manual describing the fields that we populate in HBase.en
dc.description.notesFiles provided include: cta-fall2017.pdf - PDF version of final report; cta-fall2017.zip - archive of LaTeX files used in Overleaf for the report; Clustering and Topic Analysis - Final Presentation.pdf - PDF version of final presentation; Clustering and Topic Analysis - Final Presentation.pptx - PowerPoint version of final presentation.en
dc.description.sponsorshipGlobal Event and Trend Archive Research (GETAR) project, supported by NSF IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/81761en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectClusteringen
dc.subjectTopic Analysisen
dc.subjectInformation Retrievalen
dc.subjectLDAen
dc.titleCS5604 Fall 2017 Clustering and Topic Analysisen
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
Clustering and Topic Analysis - Final Presentation.pdf
Size:
1.15 MB
Format:
Adobe Portable Document Format
Name:
Clustering and Topic Analysis - Final Presentation.pptx
Size:
935.63 KB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
cta-fall2017.pdf
Size:
1.72 MB
Format:
Adobe Portable Document Format
Name:
cta-fall2017.zip
Size:
1.88 MB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: