Analysis of Zero-Heavy Data Using a Mixture Model Approach
The problem of high proportion of zeroes has long been an interest in data analysis and modeling, however, there are no unique solutions to this problem. The solution to the individual problem really depends on its particular situation and the design of the experiment. For example, different biological, chemical, or physical processes may follow different distributions and behave differently. Different mechanisms may generate the zeroes and require different modeling approaches. So it would be quite impossible and inflexible to come up with a unique or a general solution.
In this dissertation, I focus on cases where zeroes are produced by mechanisms that create distinct sub-populations of zeroes. The dissertation is motivated from problems of chronic toxicity testing which has a data set that contains a high proportion of zeroes. The analysis of chronic test data is complicated because there are two different sources of zeroes: mortality and non-reproduction in the data. So researchers have to separate zeroes from mortality and fecundity. The use of mixture model approach which combines the two mechanisms to model the data here is appropriate because it can incorporate the mortality kind of extra zeroes.
A zero inflated Poisson (ZIP) model is used for modeling the fecundity in Ceriodaphnia dubia toxicity test. A generalized estimating equation (GEE) based ZIP model is developed to handle longitudinal data with zeroes due to mortality. A joint estimate of inhibition concentration (ICx) is also developed as potency estimation based on the mixture model approach. It is found that the ZIP model would perform better than the regular Poisson model if the mortality is high. This kind of toxicity testing also involves longitudinal data where the same subject is measured for a period of seven days. The GEE model allows the flexibility to incorporate the extra zeroes and a correlation structure among the repeated measures. The problem of zero-heavy data also exists in environmental studies in which the growth or reproduction rates of multi-species are measured. This gives rise to multivariate data. Since the inter-relationships between different species are imbedded in the correlation structure, the study of the information in the correlation of the variables, which is often accessed through principal component analysis, is one of the major interests in multi-variate data. In the case where mortality influences the variables of interests, but mortality is not the subject of interests, the use of the mixture approach can be applied to recover the information of the correlation structure. In order to investigate the effect of zeroes on multi-variate data, simulation studies on principal component analysis are performed. A method that recovers the information of the correlation structure is also presented.