Data-driven Algorithms for Critical Detection Problems: From Healthcare to Cybersecurity Defenses

TR Number

Date

2025-01-16

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Machine learning and data-driven approaches have been widely applied to critical detection problems, but their performance is often hindered by data-related challenges. This dissertation seeks to address three key challenges: data imbalance, scarcity of high-quality labels, and excessive data processing requirements, through studies in healthcare and cybersecurity.

We study healthcare problems with imbalanced clinical datasets that lead to performance disparities across prediction classes and demographic groups. We systematically evaluate these disparities and propose a Double Prioritized (DP) bias correction method that significantly improves the model performance for underrepresented groups and reduces biases. Cyber threats, such as ransomware and advanced persistent threats (APTs), have presented growing threats in recent years. Existing ransomware defenses often rely on black-box models trained on unverified traces, providing limited interpretability. To address the scarcity of reliably labeled training data, we experimentally profile runtime ransomware behaviors of real-world samples and identify core patterns, enabling explainable and trustworthy detection. For APT detection, the large size of system audit logs hinders real-time detection. We introduce Madeline, a lightweight system that efficiently processes voluminous logs with compact representations, overcoming real-time detection bottlenecks.

These contributions provide deployable and effective solutions, offering insights for future research within and beyond the fields of healthcare and cybersecurity.

Description

Keywords

Cybersecurity, Advanced Persistent Threats (APTs), Anomaly Detection, Digital Health, AI Fairness, Applied Machine Learning

Citation