Some Advances in Classifying and Modeling Complex Data

TR Number
Journal Title
Journal ISSN
Volume Title
Virginia Tech

In statistical methodology of analyzing data, two of the most commonly used techniques are classification and regression modeling. As scientific technology progresses rapidly, complex data often occurs and requires novel classification and regression modeling methodologies according to the data structure. In this dissertation, I mainly focus on developing a few approaches for analyzing the data with complex structures.

Classification problems commonly occur in many areas such as biomedical, marketing, sociology and image recognition. Among various classification methods, linear classifiers have been widely used because of computational advantages, ease of implementation and interpretation compared with non-linear classifiers. Specifically, linear discriminant analysis (LDA) is one of the most important methods in the family of linear classifiers.

For high dimensional data with number of variables p larger than the number of observations n occurs more frequently, it calls for advanced classification techniques.

In Chapter 2, I proposed a novel sparse LDA method which generalizes LDA through a regularized approach for the two-class classification problem.

The proposed method can obtain an accurate classification accuracy with attractive computation, which is suitable for high dimensional data with p>n.

In Chapter 3, I deal with the classification when the data complexity lies in the non-random missing responses in the training data set. Appropriate classification method needs to be developed accordingly. Specifically, I considered the "reject inference problem'' for the application of fraud detection for online business. For online business, to prevent fraud transactions, suspicious transactions are rejected with unknown fraud status, yielding a training data with selective missing response. A two-stage modeling approach using logistic regression is proposed to enhance the efficiency and accuracy of fraud detection.

Besides the classification problem, data from designed experiments in scientific areas often have complex structures. Many experiments are conducted with multiple variance sources. To increase the accuracy of the statistical modeling, the model need to be able to accommodate more than one error terms. In Chapter 4, I propose a variance component mixed model for a nano material experiment data to address the between group, within group and within subject variance components into a single model. To adjust possible systematic error introduced during the experiment, adjustment terms can be added. Specifically a group adaptive forward and backward selection (GFoBa) procedure is designed to select the significant adjustment terms.

A/B testing, fraud detection, linear classifier, misclassification error, net profit value, reject inference, sparse linear discriminant analysis, two-class classification, variance component mixed model.