Show simple item record

dc.contributor.authorVishwasrao, Saket
dc.contributor.authorThorve, Swapna
dc.contributor.authorTang, Lijie
dc.date.accessioned2016-05-10T01:12:12Z
dc.date.available2016-05-10T01:12:12Z
dc.date.issued2016-05-03
dc.identifier.urihttp://hdl.handle.net/10919/70947
dc.descriptionThis repository presents the work done as a part of the course CS5604 by the Clustering and Social networks team. The project report explains in detail the techniques used as well as provides a manual to easily implement our work. All the source code is provided as a part of this repository. A separate Excel document is provided that contains the clustering evaluation statistics.en_US
dc.description.abstractThe Integrated Digital Event Archiving and Library (IDEAL) project of Virginia Tech provides services for searching, browsing, analysis, and visualization of over 1 billion tweets and over 65 million webpages. The project development involved a problem based learning approach which aims to build a state-of-the-art information retrieval system in support of IDEAL. With the primary objective of building a robust search engine on top of Solr, the entire project is divided into various segments like classification, clustering, topic modeling, etc., for improving search results. Our team focuses on two tasks: clustering and social networks. Both these tasks will be considered independent for now. The clustering task aims to congregate documents in groups such that documents within a cluster would be as similar as possible. Documents are tweets and webpages and we present results for different collections. The k-means algorithm is employed for clustering the documents. Two methods were employed for feature extraction, namely, TF-IDF score and the word2vec method. Evaluation of clusters is done by two methods – Within Set Sum of Squares (WSSE) and analyzing the output of the topic analysis team to extract cluster labels and find probability scores for a document. The later strategy is a novel approach for evaluation. This strategy can be used for assessing problems of cluster labeling, likelihood of a document belonging to a cluster, and hierarchical distribution of topics and cluster. The social networking task will extract information from Twitter data by building graphs. Graph theory concepts will be applied for accomplishing this task. Using dimensionality reduction techniques and probabilistic algorithms for clustering, as well as using improving on the cluster labelling and evaluation are some of the things that can be improved on our existing work in the future. Also, the clusters that we have generated can be used as an input source in Classification, Topic Analysis and Collaborative filtering for more accurate results.en_US
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en_US
dc.language.isoen_USen_US
dc.rightsCC0 1.0 Universal*
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/*
dc.subjectClusteringen_US
dc.subjectSocial Networksen_US
dc.subjectIDEALen_US
dc.titleCS5604: Clustering and Social Networks for IDEALen_US
dc.typePresentationen_US
dc.typeSoftwareen_US
dc.typeTechnical reporten_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

CC0 1.0 Universal
License: CC0 1.0 Universal