Document Clustering for IDEAL

dc.contributor.authorThumma, Sujit Reddyen
dc.contributor.authorKalidas, Rubasrien
dc.contributor.authorTorkey, Hanaaen
dc.date.accessioned2015-05-15T04:06:25Zen
dc.date.available2015-05-15T04:06:25Zen
dc.date.issued2015-05-13en
dc.description.abstractDocument clustering is an unsupervised classification of text documents into groups (clusters). The documents with similar properties are grouped together into one cluster. Documents which have dissimilar patterns are grouped into different clusters. Clustering deals with finding a structure in a collection of unlabeled data. The main goal of this project is to enhance Solr search results with the help of offline data clustering. In our project, we propose to iterate and optimize clustering results using various clustering algorithms and techniques. Specifically, we evaluate the K-Means, Streaming K-Means, and Fuzzy K-Means algorithms available in the Apache Mahout software package. Our data consists of tweet archives and web page archives related to tweets. Document clustering involves data pre-processing, data clustering using clustering algorithms, and data post-processing. The final output which includes document ID, cluster ID, and cluster label, is stored in HBase for further indexing into the Solr search engine. Solr search recall is enhanced by boosting document relevance scores based on the clustered sets of documents. We propose three metrics to evaluate the cluster results: Silhoutte scores, confusion matrix with homogeneous labelled data, and human judgement. To optimize the clustering results we identify various tunable parameters that are input to the clustering algorithms and demonstrate the effectiveness of those tuning parameters. Finally, we have automated the entire clustering pipeline using several scripts and deployed them on a Hadoop cluster for large scale data clustering of tweet and webpage collections.en
dc.description.sponsorshipUS National Science Foundation, grant IIS - 1319578.en
dc.identifier.urihttp://hdl.handle.net/10919/52341en
dc.language.isoen_USen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectdocument clusteringen
dc.subjectclusteringen
dc.subjectmahouten
dc.subjectk-meansen
dc.subjectIDEALen
dc.titleDocument Clustering for IDEALen
dc.title.alternativeDocument Clustering for IDEAL Projecten
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
ClusteringCodeFiles.zip
Size:
97.7 MB
Format:
Adobe Portable Document Format
Description:
Clustering project source code and binary files
Loading...
Thumbnail Image
Name:
ClusteringPresentation.pdf
Size:
346.45 KB
Format:
Adobe Portable Document Format
Description:
Clustering project presentation in pdf format
Name:
ClusteringPresentation.pptx
Size:
167.58 KB
Format:
Microsoft Powerpoint XML
Description:
Clustering project presentation in pptx format
Loading...
Thumbnail Image
Name:
ClusteringReport.pdf
Size:
1.16 MB
Format:
Adobe Portable Document Format
Description:
Clustering project technical report in pdf format
Name:
ClusteringReportLatex.zip
Size:
1.4 MB
Format:
Adobe Portable Document Format
Description:
Clustering project technical report Latex source files
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: