Document Clustering for IDEAL

View/ Open
Downloads: 9043
Downloads: 196
Downloads: 116
Downloads: 6783
Downloads: 132
Date
2015-05-13Author
Thumma, Sujit Reddy
Kalidas, Rubasri
Torkey, Hanaa
Metadata
Show full item recordAbstract
Document clustering is an unsupervised classification of text documents into groups
(clusters). The documents with similar properties are grouped together into one cluster.
Documents which have dissimilar patterns are grouped into different clusters. Clustering
deals with finding a structure in a collection of unlabeled data. The main goal of this
project is to enhance Solr search results with the help of offline data clustering. In our
project, we propose to iterate and optimize clustering results using various clustering
algorithms and techniques. Specifically, we evaluate the K-Means, Streaming K-Means, and Fuzzy K-Means algorithms available in the Apache Mahout software package. Our data consists of tweet archives and web page archives related to tweets. Document clustering involves data pre-processing, data clustering using clustering algorithms, and data post-processing. The final output which includes document ID, cluster ID, and cluster
label, is stored in HBase for further indexing into the Solr search engine. Solr search recall
is enhanced by boosting document relevance scores based on the clustered sets of documents. We propose three metrics to evaluate the cluster results: Silhoutte scores,
confusion matrix with homogeneous labelled data, and human judgement. To optimize the
clustering results we identify various tunable parameters that are input to the clustering
algorithms and demonstrate the effectiveness of those tuning parameters. Finally,
we have automated the entire clustering pipeline using several scripts and deployed them on a Hadoop cluster for large scale data clustering of tweet and webpage collections.
Collections
License files: