Clustering  and Topic Analysis in CS 5604 Information Retrieval Fall 2016

Bartolome, Abigail; Islam, M. D.; Vundekode, Soumya

Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016

dc.contributor.author	Bartolome, Abigail	en
dc.contributor.author	Islam, M. D.	en
dc.contributor.author	Vundekode, Soumya	en
dc.date.accessioned	2016-12-18T02:22:28Z	en
dc.date.available	2016-12-18T02:22:28Z	en
dc.date.issued	2016-12-08	en
dc.description.abstract	The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects aim to build a robust Information Retrieval (IR) system by retrieving tweets and webpages from social media and the World Wide Web, and indexing them to be easily retrieved and analyzed. The project has been divided into different segments - Classification (CLA), Collection Management (tweets - CMT and webpages - CMW), Clustering and Topic Analysis (CTA), SOLR, and Front-End (FE). In building IR systems, documents are scored for relevance. To assist in determining a document’s relevance to a query, it is useful to know what topics are associated with the documents and what other documents relate to it. We, as the CTA team, used topic analysis and clustering techniques to aid in building this IR system. Our contributions were useful in scoring which documents are most relevant to a user’s query. We ran clustering and topic analysis algorithms on collections of tweets and webpages to identify the most discussed topics and grouped them into clusters along with their respective probabilities. We also labeled the topics and clusters, aiming for intuitive labels. The report and presentation cover the background, requirements, design and implementation of our contributions to this project. We evaluated the quality of our methodologies and describe improvements or future work that could be done to extend our project. Furthermore, we include a user manual and a developer manual to assist in any future work that may come from our efforts.	en
dc.description.notes	CTA_presentation.pdf : Final Presentation in PDF CTA_presentation.pptx : Final presentation in PPT CTA_FinalReport.pdf : Final report in PDF CTA_FinalReport.docx : Final report in docx format CTA_source_code.zip : source code required for the completion of the project CTA_results.zip : final clusters and final topics for real world events	en
dc.description.sponsorship	NSF: IIS-1319578	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/73712	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	en
dc.subject	information retrieval	en
dc.subject	clustering	en
dc.subject	topic analysis	en
dc.title	Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en