CS5604 Fall 2017 Clustering and Topic Analysis
dc.contributor.author | Baghudana, Ashish | en |
dc.contributor.author | Ahuja, Aman | en |
dc.contributor.author | Bellam, Pavan | en |
dc.contributor.author | Chintha, Rammohan | en |
dc.contributor.author | Sambaturu, Pratyush | en |
dc.contributor.author | Malpani, Ashish | en |
dc.contributor.author | Shetty, Shruti | en |
dc.contributor.author | Yang, Mo | en |
dc.date.accessioned | 2018-01-13T17:45:04Z | en |
dc.date.available | 2018-01-13T17:45:04Z | en |
dc.date.issued | 2018-01-13 | en |
dc.description.abstract | One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to build a pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETAR project, in collaboration with the Internet Archive, aims to develop an archive of webpages and tweets related to multiple events and trends that occur in the world, and develop a retrieval system to extract information from that archive. Since it is practically impossible to manually look through all the documents in a large corpus, an important component of any retrieval system is a module that is able to group and summarize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build this component for the GETAR project. Our report examines the various techniques underlying clustering and topic analysis, discusses technology choices and implementation details, and, describes the results of the k-means algorithm and latent Dirichlet allocation (LDA) on different collections of webpages and tweets. Subsequently, we provide a developer manual to help set up our framework, and finally, outline a user manual describing the fields that we populate in HBase. | en |
dc.description.notes | Files provided include: cta-fall2017.pdf - PDF version of final report; cta-fall2017.zip - archive of LaTeX files used in Overleaf for the report; Clustering and Topic Analysis - Final Presentation.pdf - PDF version of final presentation; Clustering and Topic Analysis - Final Presentation.pptx - PowerPoint version of final presentation. | en |
dc.description.sponsorship | Global Event and Trend Archive Research (GETAR) project, supported by NSF IIS-1619028 | en |
dc.identifier.uri | http://hdl.handle.net/10919/81761 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons CC0 1.0 Universal Public Domain Dedication | en |
dc.rights.uri | http://creativecommons.org/publicdomain/zero/1.0/ | en |
dc.subject | Clustering | en |
dc.subject | Topic Analysis | en |
dc.subject | Information Retrieval | en |
dc.subject | LDA | en |
dc.title | CS5604 Fall 2017 Clustering and Topic Analysis | en |
dc.type | Presentation | en |
dc.type | Report | en |
Files
Original bundle
1 - 4 of 4
Loading...
- Name:
- Clustering and Topic Analysis - Final Presentation.pdf
- Size:
- 1.15 MB
- Format:
- Adobe Portable Document Format
- Name:
- Clustering and Topic Analysis - Final Presentation.pptx
- Size:
- 935.63 KB
- Format:
- Microsoft Powerpoint XML
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: