Technical Reports, Statistics
Permanent URI for this collection
Browse
Browsing Technical Reports, Statistics by Author "Farrar, David"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Approaches to the Label-Switching Problem of Classification, Based on Partition-Space Relabeling and Label-Invariant VisualizationFarrar, David (Virginia Tech, 2006-07-15)In the context of interest, a method of cluster analysis is used to classify a set of units into a fixed number of classes. Simulation procedures with various conceptual foundations may be used to evaluate uncertainty, stability, or sampling error of such a classification. However simulation approaches may be subject to a label-switching problem, when a likelihood function, posterior density, or some objective function is invariant under permutation of class labels. We suggest a relabeling algorithm that maximizes a simple measure of agreement among classifications. However, it is known that effective summaries and visualization tools can be based on sample concurrence fractions, which we define as sample fractions with given pairs of units falling in the same cluster, and which are invariant under permutation of class labels. We expand the study of concurrence fractions by presenting a matrix theory, which is employed in relabeling, as well as in elaboration of visualization tools. We explore an ordination approach treating concurrence fractions as similarities between pairs of units. A matrix result supports straightforward application of the method of principal coordinates, leading to ordination plots in which Euclidean distances between pairs of units have a simple relationship to concurrence fractions. The use of concurrence fractions complements relabeling, by providing an efficient initial labeling.
- Clustering Monitoring Stations Based on Two Rank-Based Criteria of Similarity of Temporal ProfilesFarrar, David; Smith, Eric (Virginia Tech, 2006-09)To support evaluation of water quality trends, a water quality variable may be measured at a series of points in time, at multiple stations. Summarization of such data and detection of spatiotemporal patterns may benefit from the application of multivariate methods. We propose hierarchical cluster analysis methods that group stations according to similarities among temporal profiles, relying on standard clustering algorithms combined with two proposed, rank-based criteria of similarity. An approach complementary to standard environmental trend evaluation relies on the incremental sum of squares clustering algorithm and a criterion of similarity related to a standard test for trend heterogeneity. Relevance to the context of trend evaluation is enhanced by transforming dendrogram edge lengths to reflect cluster homogeneity according to a standard test. However, the standard homogeneity criterion may not be sensitive to patterns with possible practical significance, such as region-specific reversal in the sign of a trend. We introduce a second criterion, which is based on concordance of changes in the water quality variable between pairs of stations from one measurement time to the next, that may be sensitive to a wider range of patterns. Our suggested criteria are illustrated and compared based on application to measurements of dissolved oxygen in the James River of Virginia, USA. Results have limited similarity between the two methods, but agree in identifying a cluster associated with a locality that is characterized by pronounced negative trends at multiple stations.
- A Finite Mixture Approach for Identification of Geographic Regions with Distinctive Ecological Stressor-Response RelationshipsFarrar, David; Prins, Samantha C. Bates; Smith, Eric P. (Virginia Tech, 2006)We study a model-based clustering procedure that aims to identify geographic regions with distinctive relationships among ecological and environmental variables. We use a finite mixture model with a distinct linear regression model for each mixture component, relating a measure of environmental quality to multiple regressors. Component-specific values of regression coefficients are allowed, for a common set of regressors. We implement Bayesian inference jointly for the true partition and component regression parameters. We assume a known, prior classification of measurement locations into “clustering units,” where measurement locations belong to the same mixture component if they belong to the same clustering unit. A Metropolis algorithm, derived from a well-known Gibbs sampler, is used to sample the posterior distribution. Our approach to the label switching problem relies on constraints on cluster membership, selected based on statistics and graphical displays that do not depend upon cluster indexing. Our approach is applied to data representing streams and rivers in the state of Ohio, equating clustering units to river basins. The results appear to be interpretable given geographic features of possible ecological significance.