The problem of classifying members of a population into groups

TR Number
Date
1965
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Polytechnic Institute
Abstract

A model is assumed in which individuals are to be classified into groups as to their "potential” with respect to a given characteristic. For example, one may wish to classify college applicants into groups with respect to their ability to succeed in college. Although actual values for the “potential,” or underlying variable of classification, may be unobservable, it is assumed possible to divide the individuals into groups with respect to this characteristic. Division into groups may be accomplished either by fixing the boundaries of the underlying variable of classification or by fixing the proportion of the individuals which may belong to a given group.

For discriminating among the different groups, a set of measurements is obtained for each individual. In the example above, for instance, classification might be based on test scores achieved by the applicants on a set of tests administered to them.

Since the value of the underlying variable of classification is unobservable, we may assign, in place of this variable, a characteristic random variable to each individual. The characteristic variable will be the same for every member of a given group.

We then consider a choice of characteristic random variable and a linear combination of the observed measurements such that the correlation between the two is a maximum with respect to both the coefficients of the different measurements and the characteristic variable. If a significant correlation is found, one may then use a discriminant for a randomly selected individual the linear combination obtained by using the coefficients found by the above procedure.

In order to facilitate a test of validity for the proposed discriminant function, the distribution of a suitable function of the above correlation coefficient is found under the null hypothesis of no correlation between the underlying variable of classification and the observed measurements. A test procedure based on the statistic for which the null distribution is found is then described.

Special consideration is given in the study to the case of only two classification groups with the proportion of individuals to belong to each group fixed. For this case, in addition to obtaining the null distribution, the distribution of the test statistic is also considered under the alternative hypothesis. Low order momenta of the test criterion are obtained, and the approximate power of the proposed test is found for specific cases of the model by fitting an appropriate density to the moments derived. A general consideration of the power function and its behavior as the sample size increases and as the population multiple correlation between the underlying variable of classification and the observed measurements increases is also investigated.

Finally, the probability of misclassification, or “the problem of shrinkage" as it is often called, is considered. Possible approaches to the problem and aaae of the difficulties in investigating this problem are indicated.

Description
Keywords
Citation