Technical Reports, Statistics
Permanent URI for this collection
Browse
Recent Submissions
- Statistical Monitoring of Heteroscedastic Dose-Response Profiles from High-throughput ScreeningWilliams, J.D.; Birch, Jeffrey B.; Woodall, William H.; Ferry, N.M. (Virginia Tech, 2006)In pharmaceutical drug discovery and agricultural crop product discovery, in vivo bioassay experiments are used to identify promising compounds for further research. The reproducibility and accuracy of the bioassay is crucial to be able to correctly distinguish between active and inactive compounds. In the case of agricultural product discovery, a replicated dose-response of commercial crop protection products is assayed and used to monitor test quality. The activity of these compounds on the test organisms, the weeds, insects, or fungi, is characterized by a dose-response curve measured from the bioassay. These curves are used to monitor the quality of the bioassays. If undesirable conditions in the bioassay arise, such as equipment failure or problems with the test organisms, then a bioassay monitoring procedure is needed to quickly detect such issues. In this paper we illustrate a proposed nonlinear profile monitoring method to monitor the variability of multiple assays, the adequacy of the dose-response model chosen, and the estimated dose-response curves for aberrant cases in the presence of heteroscedasticity. We illustrate these methods with in vivo bioassay data collected over one year from DuPont Crop Protection.
- A Bayesian Hierarchical Approach to Dual Response Surface ModelingChen, Younan; Ye, Keying (Virginia Tech, 2005)In modern quality engineering, dual response surface methodology is a powerful tool to monitor an industrial process by using both the mean and the standard deviation of the measurements as the responses. The least squares method in regression is often used to estimate the coefficients in the mean and standard deviation models, and various decision criteria are proposed by researchers to find the optimal conditions. Based on the inherent hierarchical structure of the dual response problems, we propose a hierarchical Bayesian approach to model dual response surfaces. Such an approach is compared with two frequentist least squares methods by using two real data sets and simulated data.
- High Breakdown Estimation Methods for Phase I Multivariate Control ChartsJensen, Willis A.; Birch, Jeffrey B.; Woodall, William H. (Virginia Tech, 2005)The goal of Phase I monitoring of multivariate data is to identify multivariate outliers and step changes so that the estimated control limits are sufficiently accurate for Phase II monitoring. High breakdown estimation methods based on the minimum volume ellipsoid (MVE) or the minimum covariance determinant (MCD) are well suited to detecting multivariate outliers in data. However, they are difficult to implement in practice due to the extensive computation required to obtain the estimates. Based on previous studies, it is not clear which of these two estimation methods is best for control chart applications. The comprehensive simulation study here gives guidance for when to use which estimator, and control limits are provided. High breakdown estimation methods such as MCD and MVE, can be applied to a wide variety of multivariate quality control data.
- Robust Parameter Design: A Semi-Parametric ApproachPickle, Stephanie M.; Robinson, Timothy J.; Birch, Jeffrey B.; Anderson-Cook, Christine M. (Virginia Tech, 2005)Parameter design or robust parameter design (RPD) is an engineering methodology intended as a cost-effective approach for improving the quality of products and processes. The goal of parameter design is to choose the levels of the control variables that optimize a defined quality characteristic. An essential component of robust parameter design involves the assumption of well estimated models for the process mean and variance. Traditionally, the modeling of the mean and variance has been done parametrically. It is often the case, particularly when modeling the variance, that nonparametric techniques are more appropriate due to the nature of the curvature in the underlying function. Most response surface experiments involve sparse data. In sparse data situations with unusual curvature in the underlying function, nonparametric techniques often result in estimates with problematic variation whereas their parametric counterparts may result in estimates with problematic bias. We propose the use of semi-parametric modeling within the robust design setting, combining parametric and nonparametric functions to improve the quality of both mean and variance model estimation. The proposed method will be illustrated with an example and simulations.
- Speculations Concerning the First Ultraintelligent MachineGood, Irving John (Virginia Tech, 2005-03-05)The survival of man depends on the early construction of an ultraintelligent machine. In order to design an ultraintelligent machine we need to understand more about the human brain or human thought or both. In the following pages an attempt is made to take more of the magic out of the brain by means of a "subassembly" theory, which is a modification of Hebb's famous speculative cell-assembly theory. My belief is that the first ultraintelligent machine is most likely to incorporate vast artificial neural circuitry, and that its behavior will be partly explicable in terms of the subassembly theory. Later machines will all be designed by ultra-intelligent machines, and who am I to guess what principles they will devise? But probably Man will construct the deus ex machina in his own image.
- Construction Concepts for Continuum RegressionSpitzner, Dan J. (Virginia Tech, 2004-08-28)Approaches for meaningful regressor construction in the linear prediction problem are investigated in a framework similar to partial least squares and continuum regression, but weighted to allow for intelligent specification of an evaluative scheme. A cross-validatory continuum regression procedure is proposed, and shown to compare well with ordinary continuum regression in empirical demonstrations. Similar procedures are formulated from model-based constructive criteria, but are shown to be severely limited in their potential to enhance predictive performance. By paying careful attention to the interpretability of the proposed methods, the paper addresses a long-standing criticism that the current methodology relies on arbitrary mechanisms.
- Dimension Reduction for Multinomial Models Via a Kolmogorov-Smirnov Measure (KSM)Loftus, Stephen C.; House, Leanna L.; Hughey, Myra C.; Walke, Jenifer B.; Becker, Matthew H.; Belden, Lisa K. (Virginia Tech, 2015)Due to advances in technology and data collection techniques, the number of measurements often exceeds the number of samples in ecological datasets. As such, standard models that attempt to assess the relationship between variables and a response are inapplicable and require a reduction in the number of dimensions to be estimable. Several filtering methods exist to accomplish this, including Indicator Species Analyses and Sure Information Screening, but these techniques often have questionable asymptotic properties or are not readily applicable to data with multinomial responses. As such, we propose and validate a new metric called the Kolmogorov-Smirnov Measure (KSM) to be used for filtering variables. In the paper, we develop the KSM, investigate its asymptotic properties, and compare it to group equalized Indicator Species Values through simulation studies and application to a well-known biological dataset.
- Outlier Robust Nonlinear Mixed Model EstimationWilliams, James D.; Birch, Jeffrey B.; Abdel-Salam, Abdel-Salam Gomaa (Virginia Tech, 2014)In standard analyses of data well-modeled by a nonlinear mixed model (NLMM), an aberrant observation, either within a cluster, or an entire cluster itself, can greatly distort parameter estimates and subsequent standard errors. Consequently, inferences about the parameters are misleading. This paper proposes an outlier robust method based on linearization to estimate fixed effects parameters and variance components in the NLMM. An example is given using the 4-parameter logistic model and bioassay data, comparing the robust parameter estimates to the nonrobust estimates given by SASR®.
- An Improved Hybrid Genetic Algorithm with a New Local Search ProcedureWan, Wen; Birch, Jeffrey B. (Virginia Tech, 2012)A hybrid genetic algorithm (HGA) combines a genetic algorithm (GA) with an individual learning procedure. One such learning procedure is a local search technique (LS) used by the GA for refining global solutions. A HGA is also called a memetic algorithm (MA), one of the most successful and popular heuristic search methods. An important challenge of MAs is the trade-off between global and local searching as it is the case that the cost of a LS can be rather high. This paper proposes a novel, simplified, and efficient HGA with a new individual learning procedure that performs a LS only when the best offspring (solution) in the offspring population is also the best in the current parent population. Additionally, a new LS method is developed based on a three-directional search (TD), which is derivative-free and self-adaptive. The new HGA with two different LS methods (the TD and Neld-Mead simplex) is compared with a traditional HGA. Two benchmark functions are employed to illustrate the improvement of the proposed method with the new learning procedure. The results show that the new HGA greatly reduces the number of function evaluations and converges much faster to the global optimum than a traditional HGA. The TD local search method is a good choice in helping to locate a global “mountain” (or “valley”) but may not perform as well as the Nelder-Mead method in the final fine tuning toward the optimal solution.
- Interaction Analysis of Three Combination Drugs via a Modified Genetic AlgorithmWan, Wen; Pei, Xin-Yan; Grant, Steven; Birch, Jeffrey B.; Felthousen, Jessica; Dai, Yun; Fang, Hong-Bin; Tan, Ming; Sun, Shumei (Virginia Tech, 2014)Few articles have been written on analyzing and visualizing three-way interactions between drugs. Although it may be quite straightforward to extend a statistical method from two-drugs to three-drugs, it is hard to visually illustrate which dose regions are synergistic, additive, or antagonistic, due to a four-dimensional (4-D) problem of plot- ting three-drug dose regions plus a response. This problem can be converted and solved by showing some dose regions of our interest in a 3-D, three-drug dose regions. We propose to apply a modified genetic algorithm (MGA) to construct the dose regions of interest after fitting the response surface to the interaction index (II) by a semiparametric method, the model robust regression method (MRR). A case study with three anti-cancer drugs in an in vitro experiment is employed to illustrate how to find the dose regions of interest. For example, suppose researchers are interested in visualizing where the synergistic areas with II ≤ 0:4 are in 3-D. After fitting a MRR model to the calculated II, the MGA procedure is used to collect those feasible points that satisfy the estimated values of II ≤ 0:4. All these feasible points are used to construct the approximate dose regions of interest in a 3-D.
- A Phase I Cluster-Based Method for Analyzing Nonparametric ProfilesChen, Yajuan; Birch, Jeffrey B.; Woodall, William H. (Virginia Tech, 2014)A cluster-based method was used by Chen et al.²⁴ to analyze parametric profiles in Phase I of the profile monitoring process. They showed performance advantages in using their cluster-based method of analyzing parametric profiles over a non-cluster-based method with respect to more accurate estimates of the parameters and improved classification performance criteria. However, it is known that, in many cases, profiles can be better represented using a nonparametric method. In this study, we use the clusterbased method to analyze profiles that cannot be easily represented by a parametric function. The similarity matrix used during the clustering phase is based on the fits of the individual profiles with pspline regression. The clustering phase will determine an initial main cluster set which contains greater than half of the total profiles in the historical data set. The profiles with in-control T² statistics are sequentially added to the initial main cluster set and upon completion of the algorithm, the profiles in the main cluster set are classified as the in-control profiles and the profiles not in the main cluster set are classified as out-of-control profiles. A Monte Carlo study demonstrates that the cluster-based method results in superior performance over a non-cluster-based method with respect to better classification and higher power in detecting out-of-control profiles. Also, our Monte Carlo study shows that the clusterbased method has better performance than a non-cluster-based method whether the model is correctly specified or not. We illustrate the use of our method with data from the automotive industry.
- Effect of Phase I Estimation on Phase II Control Chart Performance with Profile DataChen, Yajuan; Birch, Jeffrey B.; Woodall, William H. (Virginia Tech, 2014)This paper illustrates how Phase I estimators in statistical process control (SPC) can affect the performance of Phase II control charts. The deleterious impact of poor Phase I estimators on the performance of Phase II control charts is illustrated in the context of profile monitoring. Two types of Phase I estimators are discussed. One approach uses functional cluster analysis to initially distinguish between estimated profiles from an in-control process and those from an out-of-control process. The second approach does not use clustering to make the distinction. The Phase II control charts are established based on the two resulting types of estimates and compared across varying sizes of sustained shifts in Phase II. A simulated example and a Monte Carlo study show that the performance of the Phase II control charts can be severely distorted when constructed with poor Phase I estimators. The use of clustering leads to much better Phase II performance. We also illustrate that the performance of Phase II control charts based on the poor Phase I estimators not only have more false alarms than expected but can also take much longer than expected to detect potential changes to the process.
- Nonparametric and Semiparametric Linear Mixed ModelsWaterman, Megan J.; Birch, Jeffrey B.; Abdel-Salam, Abdel-Salam Gomaa (Virginia Tech, 2012)Mixed models are powerful tools for the analysis of clustered data and many extensions of the classical linear mixed model with normally distributed response have been established. As with all parametric models, correctness of the assumed model is critical for the validity of the ensuing inference. An incorrectly specified parametric means model may be improved by using a local, or nonparametric, model. Two local models are proposed by a pointwise weighting of the marginal and conditional variance-covariance matrices. However, nonparametric models tend to fit to irregularities in the data and may provide fits with high variance. Model robust regression techniques estimate mean response as a convex combination of a parametric and a nonparametric model fit to the data. It is a semiparametric method by which incomplete or incorrectly specified parametric models can be improved by adding an appropriate amount of the nonparametric fit. We compare the approximate integrated mean square error of the parametric, nonparametric, and mixed model robust methods via a simulation study and apply these methods to two real data sets: the monthly wind speed data from counties in Ireland and the engine speed data.
- Cluster-Based Bounded Influence RegressionLawrence, David E.; Birch, Jeffrey B.; Chen, Yajuan (Virginia Tech, 2012)A regression methodology is introduced that obtains competitive, robust, efficient, high breakdown regression parameter estimates as well as providing an informative summary regarding possible multiple outlier structure. The proposed method blends a cluster analysis phase with a controlled bounded influence regression phase, thereby referred to as cluster-based bounded influence regression, or CBI. Representing the data space via a special set of anchor points, a collection of point-addition OLS regression estimators forms the basis of a metric used in defining the similarity between any two observations. Cluster analysis then yields a main cluster “half-set” of observations, with the remaining observations comprising one or more minor clusters. An initial regression estimator arises from the main cluster, with a group-additive DFFITS argument used to carefully activate the minor clusters through a bounded influence regression frame work. CBI achieves a 50% breakdown point, is regression equivariant, scale and affine equivariant and distributionally is asymptotically normal. Case studies and Monte Carlo results demonstrate the performance advantage of CBI over other popular robust regression procedures regarding coefficient stability, scale estimation and standard errors. The dendrogram of the clustering process and the weight plot are graphical displays available for multivariate outlier detection. Overall, the proposed methodology represents advancement in the field of robust regression, offering a distinct philosophical view point towards data analysis and the marriage of estimation with diagnostic summary.
- Cost Penalized Estimation and Prediction Evaluation for Split-Plot DesignsLiang, Li; Anderson-Cook, Christine M.; Robinson, Timothy J. (Virginia Tech, 2005-02-02)The use of response surface methods generally begins with a process or system involving a response y that depends on a set of k controllable input variables (factors) x₁, x₂,…,xk. To assess the effects of these factors on the response, an experiment is conducted in which the levels of the factors are varied and changes in the response are noted. The size of the experimental design (number of distinct level combinations of the factors as well as number of runs) depends on the complexity of the model the user wishes to fit. Limited resources due to time and/or cost constraints are inherent to most experiments, and hence, the user typically approaches experimentation with a desire to minimize the number of experimental trials while still being able to adequately estimate the underlying model.
- Statistical Methods for Degradation Data with Dynamic Covariates Information and an Application to Outdoor Weathering DataHong,Yili; Duan, Yuanyuan; Meeker, William Q.; Stnaley, Deborah L.; Gu, Xiahohong (Virginia Tech, 2012-10-09)Degradation data provide a useful resource for obtaining reliability information for some highly reliable products and systems. In addition to product/system degradation measurements, it is common nowadays to dynamically record product/system usage as well as other life-affecting environmental variables such as load, amount of use, temperature, and humidity. We refer to these variables as dynamic covariate information. In this paper, we introduce a class of models for analyzing degradation data with dynamic covariate information. We use a general path model with individual random effects to describe degradation paths and a vector time series model to describe the covariate process. Shape restricted splines are used to estimate the effects of dynamic covariates on the degradation process. The unknown parameters in the degradation data model and the covariate process model are estimated by using maximum likelihood. We also describe algorithms for computing an estimate of the lifetime distribution induced by the proposed degradation path model. The proposed methods are illustrated with an application for predicting the life of an organic coating in a complicated dynamic environment (i.e., changing UV spectrum and intensity, temperature, and humidity).
- Cluster-Based Profile Monitoring in Phase I AnalysisChen, Yajuan; Birch, Jeffrey B. (Virginia Tech, 2012)An innovative profile monitoring methodology is introduced for Phase I analysis. The proposed technique, which is referred to as the cluster-based profile monitoring method, incorporates a cluster analysis phase to aid in determining if non conforming profiles are present in the historical data set (HDS). To cluster the profiles, the proposed method first replaces the data for each profile with an estimated profile curve, using some appropriate regression method, and clusters the profiles based on their estimated parameter vectors. This cluster phase then yields a main cluster which contains more than half of the profiles. The initial estimated population average (PA) parameters are obtained by fitting a linear mixed model to those profiles in the main cluster. In-control profiles, determined using the Hotelling’s T² statistic, that are not contained in the initial main cluster are iteratively added to the main cluster and the mixed model is used to update the estimated PA parameters. A simulated example and Monte Carlo results demonstrate the performance advantage of this proposed method over a current noncluster based method with respect to more accurate estimates of the PA parameters and better classification performance in determining those profiles from an in-control process from those from an out-of-control process in Phase I.
- On Computing the Distribution Function for the Sum of Independent and Non-identical Random IndicatorsHong,Yili (Virginia Tech, 2011-04-05)The Poisson binomial distribution is the distribution of the sum of independent and non-identical random indicators. Each indicator follows a Bernoulli distribution with individual success probability. When all success probabilities are equal, the Poisson binomial distribution is a binomial distribution. The Poisson binomial distribution has many applications in different areas such as reliability, survival analysis, survey sampling, econometrics, etc. The computing of the cumulative distribution function (cdf) of the Poisson binomial distribution, however, is not straightforward. Approximation methods such as the Poisson approximation and normal approximations have been used in literature. Recursive formulae also have been used to compute the cdf in some areas. In this paper, we present a simple derivation for an exact formula with a closed-form expression for the cdf of the Poisson binomial distribution. The derivation uses the discrete Fourier transform of the characteristic function of the distribution. We develop an algorithm for efficient implementation of the exact formula. Numerical studies were conducted to study the accuracy of the developed algorithm and the accuracy of approximation methods. We also studied the computational efficiency of different methods. The paper is concluded with a discussion on the use of different methods in practice and some suggestions for practitioners.
- Model Robust Calibration: Method and Application to Electronically-Scanned Pressure TransducersWalker, Eric L.; Starnes, B. Alden; Birch, Jeffrey B.; Mays, James E. (American Institute of Aeronautics and Astronautics, 2010)This article presents the application of a recently developed statistical regression method to the controlled instrument calibration problem. The statistical method of Model Robust Regression (MRR), developed byMays, Birch, and Starnes, is shown to improve instrument calibration by reducing the reliance of the calibration on a predetermined parametric (e.g. polynomial, exponential, logarithmic) model. This is accomplished by allowing fits from the predetermined parametric model to be augmented by a certain portion of a fit to the residuals from the initial regression using a nonparametric (locally parametric) regression technique. The method is demonstrated for the absolute scale calibration of silicon-based pressure transducers.
- Using a Modified Genetic Algorithm to Find Feasible Regions of a Desirability FunctionWan, Wen; Birch, Jeffrey B. (Virginia Tech, 2011)The multi-response optimization (MRO) problem in response surface methodology is quite common in applications. Most of the MRO techniques such as the desirability function method by Derringer and Suich are utilized to find one or several optimal solutions. However, in fact, practitioners usually prefer to identify all of the near-optimal solutions, or all feasible regions, because some feasible regions may be more desirable than others based on practical considerations. In this paper, with benefits from the stochastic property of a genetic algorithm (GA), we present an innovative procedure using a modified GA (MGA), a computational efficient GA with a local directional search incorporated into the GA process, to approximately generate all feasible regions for the desirability function without the limitation of the number of factors in the design space. The procedure is illustrated through a case study. The MGA is also compared to other commonly used methods for determining the set of feasible regions. Using Monte Carlo simulations with two benchmark functions and a case study, it is shown that the MGA can more efficiently determine the set of feasible regions than the GA, grid methods, and the Nelder-Mead simplex algorithm.