CS5604: Clustering and Social Networks for IDEAL

dc.contributor.authorVishwasrao, Saketen
dc.contributor.authorThorve, Swapnaen
dc.contributor.authorTang, Lijieen
dc.date.accessioned2016-05-10T01:12:12Zen
dc.date.available2016-05-10T01:12:12Zen
dc.date.issued2016-05-03en
dc.descriptionThis repository presents the work done as a part of the course CS5604 by the Clustering and Social networks team. The project report explains in detail the techniques used as well as provides a manual to easily implement our work. All the source code is provided as a part of this repository. A separate Excel document is provided that contains the clustering evaluation statistics.en
dc.description.abstractThe Integrated Digital Event Archiving and Library (IDEAL) project of Virginia Tech provides services for searching, browsing, analysis, and visualization of over 1 billion tweets and over 65 million webpages. The project development involved a problem based learning approach which aims to build a state-of-the-art information retrieval system in support of IDEAL. With the primary objective of building a robust search engine on top of Solr, the entire project is divided into various segments like classification, clustering, topic modeling, etc., for improving search results. Our team focuses on two tasks: clustering and social networks. Both these tasks will be considered independent for now. The clustering task aims to congregate documents in groups such that documents within a cluster would be as similar as possible. Documents are tweets and webpages and we present results for different collections. The k-means algorithm is employed for clustering the documents. Two methods were employed for feature extraction, namely, TF-IDF score and the word2vec method. Evaluation of clusters is done by two methods – Within Set Sum of Squares (WSSE) and analyzing the output of the topic analysis team to extract cluster labels and find probability scores for a document. The later strategy is a novel approach for evaluation. This strategy can be used for assessing problems of cluster labeling, likelihood of a document belonging to a cluster, and hierarchical distribution of topics and cluster. The social networking task will extract information from Twitter data by building graphs. Graph theory concepts will be applied for accomplishing this task. Using dimensionality reduction techniques and probabilistic algorithms for clustering, as well as using improving on the cluster labelling and evaluation are some of the things that can be improved on our existing work in the future. Also, the clusters that we have generated can be used as an input source in Classification, Topic Analysis and Collaborative filtering for more accurate results.en
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en
dc.identifier.urihttp://hdl.handle.net/10919/70947en
dc.language.isoen_USen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectClusteringen
dc.subjectSocial Networksen
dc.subjectIDEALen
dc.titleCS5604: Clustering and Social Networks for IDEALen
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 6
Name:
ClusteringSN_Code.zip
Size:
39.58 MB
Format:
Description:
Source Code for clustering, evaluation of results and social networks
Loading...
Thumbnail Image
Name:
ClusteringPresentation.pdf
Size:
1.64 MB
Format:
Adobe Portable Document Format
Description:
Presentation (PDF)
Loading...
Thumbnail Image
Name:
ClusteringReport.pdf
Size:
3.4 MB
Format:
Adobe Portable Document Format
Description:
Project Report (PDF)
Name:
ClusteringReport.docx
Size:
4.19 MB
Format:
Microsoft Word XML
Description:
Project Report (WORD)
Name:
ClusteringPresentation.pptx
Size:
1.38 MB
Format:
Microsoft Powerpoint XML
Description:
Presentation (PowerPoint)
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: