Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016
dc.contributor.author | Bartolome, Abigail | en |
dc.contributor.author | Islam, M. D. | en |
dc.contributor.author | Vundekode, Soumya | en |
dc.date.accessioned | 2016-12-18T02:22:28Z | en |
dc.date.available | 2016-12-18T02:22:28Z | en |
dc.date.issued | 2016-12-08 | en |
dc.description.abstract | The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects aim to build a robust Information Retrieval (IR) system by retrieving tweets and webpages from social media and the World Wide Web, and indexing them to be easily retrieved and analyzed. The project has been divided into different segments - Classification (CLA), Collection Management (tweets - CMT and webpages - CMW), Clustering and Topic Analysis (CTA), SOLR, and Front-End (FE). In building IR systems, documents are scored for relevance. To assist in determining a document’s relevance to a query, it is useful to know what topics are associated with the documents and what other documents relate to it. We, as the CTA team, used topic analysis and clustering techniques to aid in building this IR system. Our contributions were useful in scoring which documents are most relevant to a user’s query. We ran clustering and topic analysis algorithms on collections of tweets and webpages to identify the most discussed topics and grouped them into clusters along with their respective probabilities. We also labeled the topics and clusters, aiming for intuitive labels. The report and presentation cover the background, requirements, design and implementation of our contributions to this project. We evaluated the quality of our methodologies and describe improvements or future work that could be done to extend our project. Furthermore, we include a user manual and a developer manual to assist in any future work that may come from our efforts. | en |
dc.description.notes | CTA_presentation.pdf : Final Presentation in PDF CTA_presentation.pptx : Final presentation in PPT CTA_FinalReport.pdf : Final report in PDF CTA_FinalReport.docx : Final report in docx format CTA_source_code.zip : source code required for the completion of the project CTA_results.zip : final clusters and final topics for real world events | en |
dc.description.sponsorship | NSF: IIS-1319578 | en |
dc.description.sponsorship | NSF: IIS-1619028 | en |
dc.identifier.uri | http://hdl.handle.net/10919/73712 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution 3.0 United States | en |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/us/ | en |
dc.subject | information retrieval | en |
dc.subject | clustering | en |
dc.subject | topic analysis | en |
dc.title | Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016 | en |
dc.type | Dataset | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 6
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: