Data-driven Algorithms for Critical Detection Problems: From Healthcare to Cybersecurity Defenses
dc.contributor.author | Song, Wenjia | en |
dc.contributor.committeechair | Yao, Danfeng | en |
dc.contributor.committeemember | Saltaformaggio, Brendan D. | en |
dc.contributor.committeemember | Meng, Na | en |
dc.contributor.committeemember | Gao, Peng | en |
dc.contributor.committeemember | Lourentzou, Ismini | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-01-17T09:00:19Z | |
dc.date.available | 2025-01-17T09:00:19Z | |
dc.date.issued | 2025-01-16 | |
dc.description.abstract | Machine learning and data-driven approaches have been widely applied to critical detection problems, but their performance is often hindered by data-related challenges. This dissertation seeks to address three key challenges: data imbalance, scarcity of high-quality labels, and excessive data processing requirements, through studies in healthcare and cybersecurity. We study healthcare problems with imbalanced clinical datasets that lead to performance disparities across prediction classes and demographic groups. We systematically evaluate these disparities and propose a Double Prioritized (DP) bias correction method that significantly improves the model performance for underrepresented groups and reduces biases. Cyber threats, such as ransomware and advanced persistent threats (APTs), have presented growing threats in recent years. Existing ransomware defenses often rely on black-box models trained on unverified traces, providing limited interpretability. To address the scarcity of reliably labeled training data, we experimentally profile runtime ransomware behaviors of real-world samples and identify core patterns, enabling explainable and trustworthy detection. For APT detection, the large size of system audit logs hinders real-time detection. We introduce Madeline, a lightweight system that efficiently processes voluminous logs with compact representations, overcoming real-time detection bottlenecks. These contributions provide deployable and effective solutions, offering insights for future research within and beyond the fields of healthcare and cybersecurity. | en |
dc.description.abstractgeneral | Machine learning and data-driven methods have been widely used to solve important detection problems, but their effectiveness is often limited by challenges related to the data they rely on. This dissertation focuses on three key challenges: imbalanced data, a lack of high-quality information, and the need to process large amounts of data quickly. We address these issues through studies in healthcare and cybersecurity. Data from clinical studies is often unbalanced, with certain patient groups or outcomes being underrepresented. This imbalance leads to inconsistent prediction accuracies across groups. We address this by developing a method called Double Prioritized (DP) bias correction, which significantly improves the accuracy for minority groups and reduces biases. Cyber threats are becoming increasingly serious risks. One type of prevalent malware is ransomware, which encrypts the victim's data and demands payment for recovery. Current ransomware defenses often learn from unverified data and make decisions without clear explanations. To improve this, we analyze how real-world ransomware behaves, identifying patterns that allow for more explainable and reliable detection. Another type of threat is called advanced persistent threats (APTs), which aim to stay undetected in the victim's system for a long time and exfiltrate data gradually. For APT detection, the challenge lies in analyzing the vast amount of activity data the system generates, which slows down detection. We introduce detectionname, a system designed to process large logs efficiently, enabling fast and accurate threat detection. These contributions provide practical solutions to pressing problems in healthcare and cybersecurity and offer ideas for future improvements within and beyond these fields. | en |
dc.description.degree | Doctor of Philosophy | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:41968 | en |
dc.identifier.uri | https://hdl.handle.net/10919/124235 | |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | Cybersecurity | en |
dc.subject | Advanced Persistent Threats (APTs) | en |
dc.subject | Anomaly Detection | en |
dc.subject | Digital Health | en |
dc.subject | AI Fairness | en |
dc.subject | Applied Machine Learning | en |
dc.title | Data-driven Algorithms for Critical Detection Problems: From Healthcare to Cybersecurity Defenses | en |
dc.type | Dissertation | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | doctoral | en |
thesis.degree.name | Doctor of Philosophy | en |
Files
Original bundle
1 - 1 of 1