Document Clustering for IDEAL

Thumma, Sujit Reddy; Kalidas, Rubasri; Torkey, Hanaa

Document Clustering for IDEAL

dc.contributor.author	Thumma, Sujit Reddy	en
dc.contributor.author	Kalidas, Rubasri	en
dc.contributor.author	Torkey, Hanaa	en
dc.date.accessioned	2015-05-15T04:06:25Z	en
dc.date.available	2015-05-15T04:06:25Z	en
dc.date.issued	2015-05-13	en
dc.description.abstract	Document clustering is an unsupervised classification of text documents into groups (clusters). The documents with similar properties are grouped together into one cluster. Documents which have dissimilar patterns are grouped into different clusters. Clustering deals with finding a structure in a collection of unlabeled data. The main goal of this project is to enhance Solr search results with the help of offline data clustering. In our project, we propose to iterate and optimize clustering results using various clustering algorithms and techniques. Specifically, we evaluate the K-Means, Streaming K-Means, and Fuzzy K-Means algorithms available in the Apache Mahout software package. Our data consists of tweet archives and web page archives related to tweets. Document clustering involves data pre-processing, data clustering using clustering algorithms, and data post-processing. The final output which includes document ID, cluster ID, and cluster label, is stored in HBase for further indexing into the Solr search engine. Solr search recall is enhanced by boosting document relevance scores based on the clustered sets of documents. We propose three metrics to evaluate the cluster results: Silhoutte scores, confusion matrix with homogeneous labelled data, and human judgement. To optimize the clustering results we identify various tunable parameters that are input to the clustering algorithms and demonstrate the effectiveness of those tuning parameters. Finally, we have automated the entire clustering pipeline using several scripts and deployed them on a Hadoop cluster for large scale data clustering of tweet and webpage collections.	en
dc.description.sponsorship	US National Science Foundation, grant IIS - 1319578.	en
dc.identifier.uri	http://hdl.handle.net/10919/52341	en
dc.language.iso	en_US	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	document clustering	en
dc.subject	clustering	en
dc.subject	mahout	en
dc.subject	k-means	en
dc.subject	IDEAL	en
dc.title	Document Clustering for IDEAL	en
dc.title.alternative	Document Clustering for IDEAL Project	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en