Document Clustering for IDEAL
dc.contributor.author | Thumma, Sujit Reddy | en |
dc.contributor.author | Kalidas, Rubasri | en |
dc.contributor.author | Torkey, Hanaa | en |
dc.date.accessioned | 2015-05-15T04:06:25Z | en |
dc.date.available | 2015-05-15T04:06:25Z | en |
dc.date.issued | 2015-05-13 | en |
dc.description.abstract | Document clustering is an unsupervised classification of text documents into groups (clusters). The documents with similar properties are grouped together into one cluster. Documents which have dissimilar patterns are grouped into different clusters. Clustering deals with finding a structure in a collection of unlabeled data. The main goal of this project is to enhance Solr search results with the help of offline data clustering. In our project, we propose to iterate and optimize clustering results using various clustering algorithms and techniques. Specifically, we evaluate the K-Means, Streaming K-Means, and Fuzzy K-Means algorithms available in the Apache Mahout software package. Our data consists of tweet archives and web page archives related to tweets. Document clustering involves data pre-processing, data clustering using clustering algorithms, and data post-processing. The final output which includes document ID, cluster ID, and cluster label, is stored in HBase for further indexing into the Solr search engine. Solr search recall is enhanced by boosting document relevance scores based on the clustered sets of documents. We propose three metrics to evaluate the cluster results: Silhoutte scores, confusion matrix with homogeneous labelled data, and human judgement. To optimize the clustering results we identify various tunable parameters that are input to the clustering algorithms and demonstrate the effectiveness of those tuning parameters. Finally, we have automated the entire clustering pipeline using several scripts and deployed them on a Hadoop cluster for large scale data clustering of tweet and webpage collections. | en |
dc.description.sponsorship | US National Science Foundation, grant IIS - 1319578. | en |
dc.identifier.uri | http://hdl.handle.net/10919/52341 | en |
dc.language.iso | en_US | en |
dc.rights | Creative Commons CC0 1.0 Universal Public Domain Dedication | en |
dc.rights.uri | http://creativecommons.org/publicdomain/zero/1.0/ | en |
dc.subject | document clustering | en |
dc.subject | clustering | en |
dc.subject | mahout | en |
dc.subject | k-means | en |
dc.subject | IDEAL | en |
dc.title | Document Clustering for IDEAL | en |
dc.title.alternative | Document Clustering for IDEAL Project | en |
dc.type | Presentation | en |
dc.type | Software | en |
dc.type | Technical report | en |
Files
Original bundle
1 - 5 of 5
- Name:
- ClusteringCodeFiles.zip
- Size:
- 97.7 MB
- Format:
- Adobe Portable Document Format
- Description:
- Clustering project source code and binary files
Loading...
- Name:
- ClusteringPresentation.pdf
- Size:
- 346.45 KB
- Format:
- Adobe Portable Document Format
- Description:
- Clustering project presentation in pdf format
- Name:
- ClusteringPresentation.pptx
- Size:
- 167.58 KB
- Format:
- Microsoft Powerpoint XML
- Description:
- Clustering project presentation in pptx format
Loading...
- Name:
- ClusteringReport.pdf
- Size:
- 1.16 MB
- Format:
- Adobe Portable Document Format
- Description:
- Clustering project technical report in pdf format
- Name:
- ClusteringReportLatex.zip
- Size:
- 1.4 MB
- Format:
- Adobe Portable Document Format
- Description:
- Clustering project technical report Latex source files
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: