Comparison of the Item Response Theory with Covariates Model and Explanatory Cognitive Diagnostic Model for Detecting and Explaining Differential Item Functioning

TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


In psychometrics, a concern is that the assessment is fair for all students who take it. The fairness of an assessment can be evaluated in several ways, including the examination of differential item functioning (DIF). An item exhibits DIF if a subgroup has a lower probability of answering an item correctly than another subgroup after matching on academic achievement. Subgroups include race, spoken language, disability status, or sex. Under item response theory (IRT), a single score is given to each student since IRT assumes that an assessment is only measuring one construct. However, under cognitive diagnostic modeling (CDM), an assessment measures multiple specific constructs and classifies students as having mastered the construct or not. There are several methods to detect DIF under both types of models, but most methods cannot conduct explanatory modeling. Explanatory modeling consists of predicting item responses and latent traits using relevant observed or latent covariates. If an item exhibits DIF which disadvantages a subgroup, covariates can be modeled to explain the DIF and indicate either true or spurious differences. If an item exhibited statistically significant DIF which became nonsignificant after modeling explanatory variables, then the DIF would be explained and considered spurious. If the DIF remained significant after modeling explanatory variables, then there was stronger evidence that DIF was present and not spurious. When an item exhibits DIF, the validity of the inferences from the assessment is threatened and group comparisons become inappropriate. This study evaluated the presence of DIF on the Trends in International Math and Science Study (TIMSS) between students who speak English as a first language (EFL) and students who do not speak English as a first language (multilingual learners [ML]) in the USA. The 8th grade science data was analyzed from the year 2011 since science achievement remains understudied, the 8th grade is a critical turning point for K-12 students, and because 2011 was the most recent year that item content is available from this assessment. The item response theory with covariates (IRT-C) model was used as the explanatory IRT model, while the reparameterized deterministic-input, noisy "and" gate (RDINA) model was used as the explanatory CDM (E-CDM). All released items were analyzed for DIF by both models with language status as the key grouping variable. Items that exhibited significant DIF were further analyzed by including relevant covariates. Then, if items still exhibited DIF, their content was evaluated to determine why a group was disadvantaged. Several items exhibited significant DIF under both the IRT-C and E-CDM. Most disadvantaged ML students. Under the IRT-C, two items that exhibited DIF were explained by quantitative covariates. Two items that did not exhibit significant nonuniform DIF became significant after explanation. Whether or not a student repeated elementary school was the strongest explanatory covariate, while confidence in science explained the most items. Under the E-CDM, five items initially exhibited significant uniform DIF with one also exhibiting nonuniform DIF. After scale purification, two items exhibited significant uniform DIF, and one exhibited marginally significant DIF. After explanatory modeling, no items exhibited significant uniform DIF, and only one item exhibited marginally significant nonuniform DIF. Examining covariates, home educational resources explained the most with ten items and the strongest positive covariate. Repeated elementary school had the strongest absolute effect. Examining the item content of 14 items, most items had no causal explanation for the presence of DIF. In four items, a causal mechanism was identified and concluded to exhibit item bias. An item's cognitive domain had a relationship with DIF items, with 79% of items under the Knowing domain. Based on these results, DIF that disadvantaged ML students was present among several items on this science assessment. Both the IRT-C and E-CDM identified several items exhibiting DIF, quantitative covariates explained several items exhibiting DIF, and item bias was discovered in several items. Following up on this empirical study, a simulation study was performed to evaluate DIF detection power and Type I error rates of the Wald test and likelihood ratio (LR) test, and parameter recovery when ignoring subgroups, using the compensatory reparameterized unified model (C-RUM). Factors included sample size, DIF magnitude, DIF type, Q-matrix complexity, their interaction effects, and p-value adjustment. Evaluating DIF under the C-RUM, the DIF detection method had the largest effect on Type I error rates, with the Wald test recovering the nominal p-value much better than the LR test. In terms of power, DIF magnitude was the most important factor, followed by Q-matrix complexity. As DIF magnitude increased and Q-matrix complexity decreased, power rates increased. In terms of parameter recovery, the DIF type had the strongest effect, followed by Q-matrix complexity. Nonuniform DIF recovered the parameter more than uniform DIF, while fewer attributes measured by an item improved parameter recovery. Several factors affected DIF detection power and Type I error, including DIF detection method, DIF magnitude, and Q-matrix complexity. For parameter recovery, DIF type had an impact, along with Q-matrix complexity, and DIF magnitude.



Psychometrics, Research Methodology, Quantitative Methods, Mixed Methods, Educational Research, Qualitative Methods, Science