Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion

Yan, Mingjin

Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion

dc.contributor.author	Yan, Mingjin	en
dc.contributor.committeechair	Ye, Keying	en
dc.contributor.committeemember	Prins, Samantha C. Bates	en
dc.contributor.committeemember	Spitzner, Dan J.	en
dc.contributor.committeemember	Smith, Eric P.	en
dc.contributor.department	Statistics	en
dc.date.accessioned	2014-03-14T20:19:52Z	en
dc.date.adate	2005-12-29	en
dc.date.available	2014-03-14T20:19:52Z	en
dc.date.issued	2005-11-28	en
dc.date.rdate	2006-12-29	en
dc.date.sdate	2005-12-06	en
dc.description.abstract	In cluster analysis, a fundamental problem is to determine the best estimate of the number of clusters, which has a deterministic effect on the clustering results. However, a limitation in current applications is that no convincingly acceptable solution to the best-number-of-clusters problem is available due to high complexity of real data sets. In this dissertation, we tackle this problem of estimating the number of clusters, which is particularly oriented at processing very complicated data which may contain multiple types of cluster structure. Two new methods of choosing the number of clusters are proposed which have been shown empirically to be highly effective given clear and distinct cluster structure in a data set. In addition, we propose a sequential type of clustering approach, called multi-layer clustering, by combining these two methods. Multi-layer clustering not only functions as an efficient method of estimating the number of clusters, but also, by superimposing a sequential idea, improves the flexibility and effectiveness of any arbitrary existing one-layer clustering method. Empirical studies have shown that multi-layer clustering has higher efficiency than one layer clustering approaches, especially in detecting clusters in complicated data sets. The multi-layer clustering approach has been successfully implemented in clustering the WTCHP microarray data and the results can be interpreted very well based on known biological knowledge. Choosing an appropriate clustering method is another critical step in clustering. K-means clustering is one of the most popular clustering techniques used in practice. However, the k-means method tends to generate clusters containing a nearly equal number of objects, which is referred to as the ``equal-size'' problem. We propose a clustering method which competes with the k-means method. Our newly defined method is aimed at overcoming the so-called ``equal-size'' problem associated with the k-means method, while maintaining its advantage of computational simplicity. Advantages of the proposed method over k-means clustering have been demonstrated empirically using simulated data with low dimensionality.	en
dc.description.degree	Ph. D.	en
dc.identifier.other	etd-12062005-153906	en
dc.identifier.sourceurl	http://scholar.lib.vt.edu/theses/available/etd-12062005-153906/	en
dc.identifier.uri	http://hdl.handle.net/10919/29957	en
dc.publisher	Virginia Tech	en
dc.relation.haspart	Proposal-Face.pdf	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Gap statistic	en
dc.subject	Multi-layer clustering	en
dc.subject	DD-weighted gap statistic	en
dc.subject	Cluster analysis	en
dc.subject	Weighted gap statistic	en
dc.subject	Number of clusters	en
dc.subject	K-means clustering	en
dc.title	Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion	en
dc.type	Dissertation	en
thesis.degree.discipline	Statistics	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Proposal-Face.pdf
Size:: 938.17 KB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations