VTechWorks staff will be away for the winter holidays until January 5, 2026, and will respond to requests at that time.
 

Methodologies for Systematic Evaluation and Targeted Mitigation of Deficiencies in Critical Machine Learning Models

Files

TR Number

Date

2025-08-07

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Despite the growing use of machine learning in healthcare, critical challenges remain unaddressed, models often fail to respond appropriately to life-threatening conditions, exhibit poor generalizability in real-world clinical settings, and show unequal performance across patient subgroups. These limitations compromise the reliability, safety, and equity of AI-driven decision-making, especially in high-stakes environments like intensive care. In this work, we outline a comprehensive evaluation and mitigation strategy to address both responsiveness and fairness shortcomings.. We develop testing approaches to systematically assess models' ability to respond to serious medical emergencies. Using generated test cases, we found that statistical machine-learning models trained solely from patient data are grossly insufficient and have many dangerous blind spots. Specifically, we identified serious deficiencies in the models' responsiveness, i.e., the inability to recognize severely impaired medical conditions or rapidly deteriorating health. For in-hospital mortality prediction, the models tested using our synthesized cases fail to recognize 66% of the test cases involving injuries. In some instances, the models fail to generate adequate mortality risk scores for all test cases. We also applied our testing methods to assess the responsiveness of 5-year breast and lung cancer prediction models and identified similar kinds of deficiencies. To address the low responsiveness of machine learning models to critical health conditions, we integrated domain knowledge into the modeling framework using two complementary strategies: (i) a custom loss function that penalizes violations of medical constraints, and (ii) a rule-based decision tree derived from clinical knowledge, aggregated with a data-driven model. The resulting knowledge-guided models demonstrated notable improvements in performance, particularly under critical scenarios. For instance, recall improved by 7% on the full glucose test set and by 27% for critically high glucose cases, achieving 94–99% accuracy in detecting patients with severely abnormal glucose levels. Similar trends were observed for other vital signs. Moreover, the decision tree-based hybrid model enhanced early sepsis detection accuracy by 4%, underscoring the benefit of combining clinical knowledge with statistical learning for high-stakes medical applications. In addition, we address a bias problem we identified in models predicting type 2 diabetes, which disproportionately impacts younger adults, a growing segment of diabetes patients. In this research, we identify this deficiency in traditional machine learning models and propose an algorithm to mitigate the bias towards the young population when predicting diabetes. Deviating from the traditional concept of one-model-fits-all, we train customized machine-learning models for each age group. Our proposed solution consistently improves recall of diabetes class by 26% to 40% in the young age group (30-44). Moreover, our technique outperforms 7 commonly used whole-group sampling techniques such as random oversampling, SMOTE, and AdaSyns techniques by at least 36% in terms of diabetes recall in the young age group.

Description

Keywords

AI Trustworthiness, Responsiveness, Knowledge guided ML, Custom Loss, Healthcare

Citation