CS5604: Clustering and Social Networks for IDEAL

Vishwasrao, Saket; Thorve, Swapna; Tang, Lijie

CS5604: Clustering and Social Networks for IDEAL

dc.contributor.author	Vishwasrao, Saket	en
dc.contributor.author	Thorve, Swapna	en
dc.contributor.author	Tang, Lijie	en
dc.date.accessioned	2016-05-10T01:12:12Z	en
dc.date.available	2016-05-10T01:12:12Z	en
dc.date.issued	2016-05-03	en
dc.description	This repository presents the work done as a part of the course CS5604 by the Clustering and Social networks team. The project report explains in detail the techniques used as well as provides a manual to easily implement our work. All the source code is provided as a part of this repository. A separate Excel document is provided that contains the clustering evaluation statistics.	en
dc.description.abstract	The Integrated Digital Event Archiving and Library (IDEAL) project of Virginia Tech provides services for searching, browsing, analysis, and visualization of over 1 billion tweets and over 65 million webpages. The project development involved a problem based learning approach which aims to build a state-of-the-art information retrieval system in support of IDEAL. With the primary objective of building a robust search engine on top of Solr, the entire project is divided into various segments like classification, clustering, topic modeling, etc., for improving search results. Our team focuses on two tasks: clustering and social networks. Both these tasks will be considered independent for now. The clustering task aims to congregate documents in groups such that documents within a cluster would be as similar as possible. Documents are tweets and webpages and we present results for different collections. The k-means algorithm is employed for clustering the documents. Two methods were employed for feature extraction, namely, TF-IDF score and the word2vec method. Evaluation of clusters is done by two methods – Within Set Sum of Squares (WSSE) and analyzing the output of the topic analysis team to extract cluster labels and find probability scores for a document. The later strategy is a novel approach for evaluation. This strategy can be used for assessing problems of cluster labeling, likelihood of a document belonging to a cluster, and hierarchical distribution of topics and cluster. The social networking task will extract information from Twitter data by building graphs. Graph theory concepts will be applied for accomplishing this task. Using dimensionality reduction techniques and probabilistic algorithms for clustering, as well as using improving on the cluster labelling and evaluation are some of the things that can be improved on our existing work in the future. Also, the clusters that we have generated can be used as an input source in Classification, Topic Analysis and Collaborative filtering for more accurate results.	en
dc.description.sponsorship	NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)	en
dc.identifier.uri	http://hdl.handle.net/10919/70947	en
dc.language.iso	en_US	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	Clustering	en
dc.subject	Social Networks	en
dc.subject	IDEAL	en
dc.title	CS5604: Clustering and Social Networks for IDEAL	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en