Technical Reports, Statistics
Permanent URI for this collection
Browse
Browsing Technical Reports, Statistics by Title
Now showing 1 - 20 of 39
Results Per Page
Sort Options
- Approaches to the Label-Switching Problem of Classification, Based on Partition-Space Relabeling and Label-Invariant VisualizationFarrar, David (Virginia Tech, 2006-07-15)In the context of interest, a method of cluster analysis is used to classify a set of units into a fixed number of classes. Simulation procedures with various conceptual foundations may be used to evaluate uncertainty, stability, or sampling error of such a classification. However simulation approaches may be subject to a label-switching problem, when a likelihood function, posterior density, or some objective function is invariant under permutation of class labels. We suggest a relabeling algorithm that maximizes a simple measure of agreement among classifications. However, it is known that effective summaries and visualization tools can be based on sample concurrence fractions, which we define as sample fractions with given pairs of units falling in the same cluster, and which are invariant under permutation of class labels. We expand the study of concurrence fractions by presenting a matrix theory, which is employed in relabeling, as well as in elaboration of visualization tools. We explore an ordination approach treating concurrence fractions as similarities between pairs of units. A matrix result supports straightforward application of the method of principal coordinates, leading to ordination plots in which Euclidean distances between pairs of units have a simple relationship to concurrence fractions. The use of concurrence fractions complements relabeling, by providing an efficient initial labeling.
- A Bayesian Hierarchical Approach to Dual Response Surface ModelingChen, Younan; Ye, Keying (Virginia Tech, 2005)In modern quality engineering, dual response surface methodology is a powerful tool to monitor an industrial process by using both the mean and the standard deviation of the measurements as the responses. The least squares method in regression is often used to estimate the coefficients in the mean and standard deviation models, and various decision criteria are proposed by researchers to find the optimal conditions. Based on the inherent hierarchical structure of the dual response problems, we propose a hierarchical Bayesian approach to model dual response surfaces. Such an approach is compared with two frequentist least squares methods by using two real data sets and simulated data.
- Cluster-Based Bounded Influence RegressionLawrence, David E.; Birch, Jeffrey B.; Chen, Yajuan (Virginia Tech, 2012)A regression methodology is introduced that obtains competitive, robust, efficient, high breakdown regression parameter estimates as well as providing an informative summary regarding possible multiple outlier structure. The proposed method blends a cluster analysis phase with a controlled bounded influence regression phase, thereby referred to as cluster-based bounded influence regression, or CBI. Representing the data space via a special set of anchor points, a collection of point-addition OLS regression estimators forms the basis of a metric used in defining the similarity between any two observations. Cluster analysis then yields a main cluster “half-set” of observations, with the remaining observations comprising one or more minor clusters. An initial regression estimator arises from the main cluster, with a group-additive DFFITS argument used to carefully activate the minor clusters through a bounded influence regression frame work. CBI achieves a 50% breakdown point, is regression equivariant, scale and affine equivariant and distributionally is asymptotically normal. Case studies and Monte Carlo results demonstrate the performance advantage of CBI over other popular robust regression procedures regarding coefficient stability, scale estimation and standard errors. The dendrogram of the clustering process and the weight plot are graphical displays available for multivariate outlier detection. Overall, the proposed methodology represents advancement in the field of robust regression, offering a distinct philosophical view point towards data analysis and the marriage of estimation with diagnostic summary.
- Cluster-Based Profile Monitoring in Phase I AnalysisChen, Yajuan; Birch, Jeffrey B. (Virginia Tech, 2012)An innovative profile monitoring methodology is introduced for Phase I analysis. The proposed technique, which is referred to as the cluster-based profile monitoring method, incorporates a cluster analysis phase to aid in determining if non conforming profiles are present in the historical data set (HDS). To cluster the profiles, the proposed method first replaces the data for each profile with an estimated profile curve, using some appropriate regression method, and clusters the profiles based on their estimated parameter vectors. This cluster phase then yields a main cluster which contains more than half of the profiles. The initial estimated population average (PA) parameters are obtained by fitting a linear mixed model to those profiles in the main cluster. In-control profiles, determined using the Hotelling’s T² statistic, that are not contained in the initial main cluster are iteratively added to the main cluster and the mixed model is used to update the estimated PA parameters. A simulated example and Monte Carlo results demonstrate the performance advantage of this proposed method over a current noncluster based method with respect to more accurate estimates of the PA parameters and better classification performance in determining those profiles from an in-control process from those from an out-of-control process in Phase I.
- Clustering Monitoring Stations Based on Two Rank-Based Criteria of Similarity of Temporal ProfilesFarrar, David; Smith, Eric (Virginia Tech, 2006-09)To support evaluation of water quality trends, a water quality variable may be measured at a series of points in time, at multiple stations. Summarization of such data and detection of spatiotemporal patterns may benefit from the application of multivariate methods. We propose hierarchical cluster analysis methods that group stations according to similarities among temporal profiles, relying on standard clustering algorithms combined with two proposed, rank-based criteria of similarity. An approach complementary to standard environmental trend evaluation relies on the incremental sum of squares clustering algorithm and a criterion of similarity related to a standard test for trend heterogeneity. Relevance to the context of trend evaluation is enhanced by transforming dendrogram edge lengths to reflect cluster homogeneity according to a standard test. However, the standard homogeneity criterion may not be sensitive to patterns with possible practical significance, such as region-specific reversal in the sign of a trend. We introduce a second criterion, which is based on concordance of changes in the water quality variable between pairs of stations from one measurement time to the next, that may be sensitive to a wider range of patterns. Our suggested criteria are illustrated and compared based on application to measurements of dissolved oxygen in the James River of Virginia, USA. Results have limited similarity between the two methods, but agree in identifying a cluster associated with a locality that is characterized by pronounced negative trends at multiple stations.
- Construction Concepts for Continuum RegressionSpitzner, Dan J. (Virginia Tech, 2004-08-28)Approaches for meaningful regressor construction in the linear prediction problem are investigated in a framework similar to partial least squares and continuum regression, but weighted to allow for intelligent specification of an evaluative scheme. A cross-validatory continuum regression procedure is proposed, and shown to compare well with ordinary continuum regression in empirical demonstrations. Similar procedures are formulated from model-based constructive criteria, but are shown to be severely limited in their potential to enhance predictive performance. By paying careful attention to the interpretability of the proposed methods, the paper addresses a long-standing criticism that the current methodology relies on arbitrary mechanisms.
- Cost Penalized Estimation and Prediction Evaluation for Split-Plot DesignsLiang, Li; Anderson-Cook, Christine M.; Robinson, Timothy J. (Virginia Tech, 2005-02-02)The use of response surface methods generally begins with a process or system involving a response y that depends on a set of k controllable input variables (factors) x₁, x₂,…,xk. To assess the effects of these factors on the response, an experiment is conducted in which the levels of the factors are varied and changes in the response are noted. The size of the experimental design (number of distinct level combinations of the factors as well as number of runs) depends on the complexity of the model the user wishes to fit. Limited resources due to time and/or cost constraints are inherent to most experiments, and hence, the user typically approaches experimentation with a desire to minimize the number of experimental trials while still being able to adequately estimate the underlying model.
- Dimension Reduction for Multinomial Models Via a Kolmogorov-Smirnov Measure (KSM)Loftus, Stephen C.; House, Leanna L.; Hughey, Myra C.; Walke, Jenifer B.; Becker, Matthew H.; Belden, Lisa K. (Virginia Tech, 2015)Due to advances in technology and data collection techniques, the number of measurements often exceeds the number of samples in ecological datasets. As such, standard models that attempt to assess the relationship between variables and a response are inapplicable and require a reduction in the number of dimensions to be estimable. Several filtering methods exist to accomplish this, including Indicator Species Analyses and Sure Information Screening, but these techniques often have questionable asymptotic properties or are not readily applicable to data with multinomial responses. As such, we propose and validate a new metric called the Kolmogorov-Smirnov Measure (KSM) to be used for filtering variables. In the paper, we develop the KSM, investigate its asymptotic properties, and compare it to group equalized Indicator Species Values through simulation studies and application to a well-known biological dataset.
- Effect of Phase I Estimation on Phase II Control Chart Performance with Profile DataChen, Yajuan; Birch, Jeffrey B.; Woodall, William H. (Virginia Tech, 2014)This paper illustrates how Phase I estimators in statistical process control (SPC) can affect the performance of Phase II control charts. The deleterious impact of poor Phase I estimators on the performance of Phase II control charts is illustrated in the context of profile monitoring. Two types of Phase I estimators are discussed. One approach uses functional cluster analysis to initially distinguish between estimated profiles from an in-control process and those from an out-of-control process. The second approach does not use clustering to make the distinction. The Phase II control charts are established based on the two resulting types of estimates and compared across varying sizes of sustained shifts in Phase II. A simulated example and a Monte Carlo study show that the performance of the Phase II control charts can be severely distorted when constructed with poor Phase I estimators. The use of clustering leads to much better Phase II performance. We also illustrate that the performance of Phase II control charts based on the poor Phase I estimators not only have more false alarms than expected but can also take much longer than expected to detect potential changes to the process.
- Error Models in Geographic Information Systems Vector Data Using Bayesian MethodsLove, Kimberly R.; Ye, Keying; Smith, Eric P.; Prisley, Stephen P. (Virginia Tech, 2007)Geographic Information Systems, or GIS, has been an evolving science since its introduction. Recently, many users have become concerned with the incorporation of error analysis into GIS map products. In particular, there is concern over the error in the location of features in vector data, which relies heavily on geographic x—; y— coordinates. Current work in the field is based on bivariate normal distributions for these points, and their extension to line and polygon features. We propose here to incorporate Bayesian methodology into this existing model, which presents multiple advantages over existing methods. Bayesian methods allow for the incorporation of expert and historical knowledge and reduce the number of observations required to perform an accurate analysis. This is essential to the field of GIS where multiple observations are rare and outside knowledge is often very informative. Bayesian methods also provide results that are more easily understood by the average GIS user. We explore this addition and provide several examples based on our calculations. We conclude by discussing the advantages of Bayesian analysis for GIS vector data, and discuss our ongoing work, which is being conducted under a research grant from the National Geospatial Intelligence Agency.
- Evaluating And Interpreting InteractionsHinkelmann, Klaus H. (Virginia Tech, 2004-12-13)The notion of interaction plays an important − and sometimes frightening − role in the analysis and interpretation of results from observational and experimental studies. In general, results are much easier to explain and to implement if interaction effects are not present. It is for this reason that they are often assumed to be negligible. This may, however, lead to erroneous conclusions and poor actions. One reason why interactions are sometimes feared is because of limited understanding of what the word “interaction” actually means, in a practical sense and,in particular, in a statistical sense. As far as the latter is concerned, simply stating that interaction is significant is generally not sufficient. Subsequent interpretation of that finding is needed, and that brings us back to the definition and meaning of interaction within the context of the experimental setting. In the following sections we shall define and discuss various types of variables that affect the response and the types of interactions among them. These notions will be illustrated for one particular experiment to which we shall return throughout our discussion. To help us in the interpretation of interactions we take a closer look at the definitions of two-factor and three-factor interactions in terms of simple effects. This is followed by a discussion of the nature of interactions and the role they play in the context of the experiment, from the statistical point of view and with regard to the interpretation of the results. After a general overview of how to dissect interactions we return to our example and perform a detailed analysis and interpretation of the data using SASr (SAS Institute, 2000), in particular PROC GLM and some of its options, such as SLICE. We mention also different methods for the analysis when interaction is actually present. We conclude the analytical part with a discussion of a useful graphical method when no error term is available for testing for interactions. Finally, we summarize the results with some recommendation reminding the reader that in all of this the experimental design is of fundamental importance.
- A Finite Mixture Approach for Identification of Geographic Regions with Distinctive Ecological Stressor-Response RelationshipsFarrar, David; Prins, Samantha C. Bates; Smith, Eric P. (Virginia Tech, 2006)We study a model-based clustering procedure that aims to identify geographic regions with distinctive relationships among ecological and environmental variables. We use a finite mixture model with a distinct linear regression model for each mixture component, relating a measure of environmental quality to multiple regressors. Component-specific values of regression coefficients are allowed, for a common set of regressors. We implement Bayesian inference jointly for the true partition and component regression parameters. We assume a known, prior classification of measurement locations into “clustering units,” where measurement locations belong to the same mixture component if they belong to the same clustering unit. A Metropolis algorithm, derived from a well-known Gibbs sampler, is used to sample the posterior distribution. Our approach to the label switching problem relies on constraints on cluster membership, selected based on statistics and graphical displays that do not depend upon cluster indexing. Our approach is applied to data representing streams and rivers in the state of Ohio, equating clustering units to river basins. The results appear to be interpretable given geographic features of possible ecological significance.
- High Breakdown Estimation Methods for Phase I Multivariate Control ChartsJensen, Willis A.; Birch, Jeffrey B.; Woodall, William H. (Virginia Tech, 2005)The goal of Phase I monitoring of multivariate data is to identify multivariate outliers and step changes so that the estimated control limits are sufficiently accurate for Phase II monitoring. High breakdown estimation methods based on the minimum volume ellipsoid (MVE) or the minimum covariance determinant (MCD) are well suited to detecting multivariate outliers in data. However, they are difficult to implement in practice due to the extensive computation required to obtain the estimates. Based on previous studies, it is not clear which of these two estimation methods is best for control chart applications. The comprehensive simulation study here gives guidance for when to use which estimator, and control limits are provided. High breakdown estimation methods such as MCD and MVE, can be applied to a wide variety of multivariate quality control data.
- An Improved Genetic Algorithm Using a Directional SearchWan, Wen; Birch, Jeffrey B. (Virginia Tech, 2009)The genetic algorithm (GA), a very powerful tool used in optimization, has been applied in various fields including statistics. However, the general GA is usually computationally intensive, often having to perform a large number of evaluations of an objective function. This paper presents four different versions of computationally efficient genetic algorithms by incorporating several different local directional searches into the GA process. These local searches are based on using the method of steepest descent (SD), the Newton-Raphson method (NR), a derivative-free directional search method (denoted by “DFDS”), and a method that combines SD with DFDS. Some benchmark functions, such as a low-dimensional function versus a high-dimensional function, and a relatively bumpy function versus a very bumpy function, are employed to illustrate the improvement of these proposed methods through a Monte Carlo simulation study using a split-plot design. A real problem related to the multi-response optimization problem is also used to illustrate the improvement of these proposed methods over the traditional GA and over the method implemented in the Design-Expert statistical software package. Our results show that the GA can be improved both in accuracy and in computational efficiency in most cases by incorporating a local directional search into the GA process.
- An Improved Hybrid Genetic Algorithm with a New Local Search ProcedureWan, Wen; Birch, Jeffrey B. (Virginia Tech, 2012)A hybrid genetic algorithm (HGA) combines a genetic algorithm (GA) with an individual learning procedure. One such learning procedure is a local search technique (LS) used by the GA for refining global solutions. A HGA is also called a memetic algorithm (MA), one of the most successful and popular heuristic search methods. An important challenge of MAs is the trade-off between global and local searching as it is the case that the cost of a LS can be rather high. This paper proposes a novel, simplified, and efficient HGA with a new individual learning procedure that performs a LS only when the best offspring (solution) in the offspring population is also the best in the current parent population. Additionally, a new LS method is developed based on a three-directional search (TD), which is derivative-free and self-adaptive. The new HGA with two different LS methods (the TD and Neld-Mead simplex) is compared with a traditional HGA. Two benchmark functions are employed to illustrate the improvement of the proposed method with the new learning procedure. The results show that the new HGA greatly reduces the number of function evaluations and converges much faster to the global optimum than a traditional HGA. The TD local search method is a good choice in helping to locate a global “mountain” (or “valley”) but may not perform as well as the Nelder-Mead method in the final fine tuning toward the optimal solution.
- Interaction Analysis of Three Combination Drugs via a Modified Genetic AlgorithmWan, Wen; Pei, Xin-Yan; Grant, Steven; Birch, Jeffrey B.; Felthousen, Jessica; Dai, Yun; Fang, Hong-Bin; Tan, Ming; Sun, Shumei (Virginia Tech, 2014)Few articles have been written on analyzing and visualizing three-way interactions between drugs. Although it may be quite straightforward to extend a statistical method from two-drugs to three-drugs, it is hard to visually illustrate which dose regions are synergistic, additive, or antagonistic, due to a four-dimensional (4-D) problem of plot- ting three-drug dose regions plus a response. This problem can be converted and solved by showing some dose regions of our interest in a 3-D, three-drug dose regions. We propose to apply a modified genetic algorithm (MGA) to construct the dose regions of interest after fitting the response surface to the interaction index (II) by a semiparametric method, the model robust regression method (MRR). A case study with three anti-cancer drugs in an in vitro experiment is employed to illustrate how to find the dose regions of interest. For example, suppose researchers are interested in visualizing where the synergistic areas with II ≤ 0:4 are in 3-D. After fitting a MRR model to the calculated II, the MGA procedure is used to collect those feasible points that satisfy the estimated values of II ≤ 0:4. All these feasible points are used to construct the approximate dose regions of interest in a 3-D.
- An Investigation of Combinations of Multivariate Shewhart and MEWMA Control Charts for Monitoring the Mean Vector and Covariance MatrixReynolds, Marion R. Jr.; Stoumbos, Zachary G. (Virginia Tech, 2008-01-22)When monitoring a process which has multivariate normal variables, the Shewhart-type control chart (Hotelling (1947)) traditionally used for monitoring the process mean vector is effective for detecting large shifts, but for detecting small shifts it is more effective to use the multivariate exponentially weighted moving average (MEWMA) control chart proposed by Lowry et al. (1992). It has been proposed that better overall performance in detecting small and large shifts in the mean can be obtained by using the MEWMA chart in combination with the Shewhart chart. Here we investigate the performance of this combination in the context of the more general problem of detecting changes in the mean or increases in variability. Reynolds and Cho (2006) recently investigated combinations of the MEWMA chart for the mean and MEWMA-type charts based on squared deviations of the observations from the target, and found that these combinations have excellent performance in detecting sustained shifts in the mean or in variability. Here we consider both sustained and transient shifts, and show that a combination of two MEWMA charts has better overall performance than the combination of the MEWMA and Shewhart charts. We also consider a three-chart combination consisting of the MEWMA chart for the mean, an MEWMA-type chart of squared deviations from target, and the Shewhart chart. When the sample size is n = 1 this three-chart combination does not seem to have better overall performance than the combination of the two MEWMA charts. When n > 1 the three-chart combination has significantly better performance for some mean shifts, but somewhat worse performance for shifts in variability.
- Linear Mixed Model Robust RegressionWaterman, Megan J.; Birch, Jeffrey B.; Schabenberger, Oliver (Virginia Tech, 2006-11-05)Mixed models are powerful tools for the analysis of clustered data and many extensions of the classical linear mixed model with normally distributed response have been established. As with all parametric models, correctness of the assumed model is critical for the validity of the ensuing inference. An incorrectly specified parametric means model may be improved by using a local, or nonparametric, model. Two local models are proposed by a pointwise weighting of the marginal and conditional variance-covariance matrices. However, nonparametric models tend to fit to irregularities in the data and provide fits with high variance. Model robust regression techniques estimate mean response as a convex combination of a parametric and a nonparametric model fits to the data. It is a semiparametric method by which incomplete or incorrect specified parametric models can be improved through adding an appropriate amount of the nonparametric fit. We compare the approximate integrated mean square error of the parametric, nonparametric, and mixed model robust methods via a simulation study, and apply these methods to monthly wind speed data from counties in Ireland.
- Model Robust Calibration: Method and Application to Electronically-Scanned Pressure TransducersWalker, Eric L.; Starnes, B. Alden; Birch, Jeffrey B.; Mays, James E. (American Institute of Aeronautics and Astronautics, 2010)This article presents the application of a recently developed statistical regression method to the controlled instrument calibration problem. The statistical method of Model Robust Regression (MRR), developed byMays, Birch, and Starnes, is shown to improve instrument calibration by reducing the reliance of the calibration on a predetermined parametric (e.g. polynomial, exponential, logarithmic) model. This is accomplished by allowing fits from the predetermined parametric model to be augmented by a certain portion of a fit to the residuals from the initial regression using a nonparametric (locally parametric) regression technique. The method is demonstrated for the absolute scale calibration of silicon-based pressure transducers.
- A Model to Predict the Impact of Specification Changes on Chloride-Induced Corrosion Service Life of Virginia Bridge DecksTrevor J. Kirkpatrick; Richard E. Weyers; Anderson-Cook, Christine M.; Michael M. Sprinkel; Michael C. Brown (Virginia Center for Transportation Innovation and Research, 2002-10)A model to determine the time to first repair and subsequent rehabilitation of concrete bridge decks exposed to chloride deicer salts that recognizes and incorporates the statistical nature of factors affecting the corrosion process is developed. The model expands on an existing deterministic model by using statistical computing techniques, including resampling techniques such as the parametric and simple bootstrap. Emphasis was placed on the diffusion portion of the diffusion-cracking model, but advances can be readily included for the time for corrosion deterioration after corrosion initiation. Data collected from ten bridge decks built in Virginia between 1981 and 1994 were used to model the surface chloride concentration, apparent diffusion coefficient, and clear cover depth. Several ranges of the chloride corrosion initiation concentration, as determined from the available literature, were investigated. The time to first repair and subsequent rehabilitation predicted by the stochastic model is shorter than the time to first repair and subsequent rehabilitation predicted by the deterministic model. The stochastic model is believed to more accurately reflect the true nature of bridge deck deterioration because it takes into account the fact that data for each of the parameters affecting chloride diffusion and corrosion initiation are not necessarily normally distributed. The model was validated by comparison of projected service lives of bridge decks built from 1981 to 1994 derived from the model to historical service life data for 129 bridge decks built in Virginia between 1968 and 1972. The time to rehabilitation predicted for the set of bridge decks built between 1981 and 1994 by the stochastic model was approximately 13 years longer than the normalized time to rehabilitation projected for the bridge decks built between 1968 and 1972 using historical data. The time to first repair and rehabilitation predicted by the probabilistic method more closely matches that of historical data than the time to first repair and rehabilitation predicted by the average value solution. The additional service life expected for the set of bridges built between 1981 and 1994 over those constructed from 1968 to 1972 can be attributed to the decrease in w/c ratio from 0.47 to 0.45 and slight increase in as-built cover depth from approximately 50 mm (2 in.) to 63.5 to 76 mm (2.5 to 3.0 in.).