CS5604 Fall 2017 Clustering and Topic Analysis


One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive.

Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summarize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project.

Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algorithm and latent Dirichlet allocation (LDA) on different collections of webpages and tweets. Subsequently, we provide a developer manual to help set up our framework, and finally, outline a user manual describing the fields that we populate in HBase.



Clustering, Topic Analysis, Information Retrieval, LDA