Estimating the impact of third-party evaluator training and characteristics on the scoring of written organizational self-assessments

TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


This study examined the process of third-party scoring of organizational self-assessments. An experiment was conducted to illustrate the magnitude of score consistency and accuracy among evaluators, estimate the impact of frame-of-reference (FOR) training on score consistency and accuracy, and explore the relationship between evaluator characteristics and score accuracy. The organizational self-assessment used was the 1995 Malcolm Baldrige National Quality Award Colony Fasteners Case Study. The subjects were 81 graduate students enrolled in two televised graduate engineering courses with considerable quality management content.

Subjects were randomly assigned to groups and randomly assigned to four of the seven categories of the Baldrige Award. Each subject evaluated the case study against two categories prior to the treatment. Subjects in the control group evaluated two additional categories and then a two and one-half hour FOR training intervention was provided to all subjects. Next, subjects in the treatment group evaluated their two additional categories. Finally, a questionnaire was administered regarding evaluator characteristics related to previous experience and education.

Accuracy was assessed by comparing subjects’ scores to experts’ scores and calculating indices (elevation and dimensional accuracy) for each subject’s scores on each category. Prior to training, no statistical differences were found between groups, but a leniency effect was observed for all subjects. Category 6.0, Business Results, and Category 7.0, Customer Focus and Satisfaction, had statistically smaller score variances than the other five categories.

After training, group x time ANOVAs found evidence of an interaction. Examination of simple effects found significant differences between the group mean scores for all three items from Category 6.0 and two of the four items from Category 5.0. Significant simple time effects were found for all three items from Category 6.0 for the treatment group. No meaningful differences were found between group score variances. A significant difference in category score variance was seen across categories for the untrained group. Training improved elevation accuracy, but no evidence was seen of effects on DA.

Exploratory regression produced a prediction equation for DA with an adjusted R-square of 0.538. Predictors included work experience, QA/QC experience, employer’s industry and employer’s size.



organizational assessment, evaluator training, Baldrige Award