Scholarly Works, Computer Science

Permanent URI for this collection

https://hdl.handle.net/10919/24290

Research articles, presentations, and other scholarship

Browse

Now showing 1 - 20 of 780

FairChain: A Trusted and Transparent Blockchain-Based Ecosystem for Drug Development for Nagoya Protocol Implementation
AlSalamah, Shada; Alnehmi, Shaima A.; Abanumai, Anfal A.; Alnashri, Asmaa H.; Alduhim, Sara S.; Alnamlah, Norah A.; AlGhamdi, Khulood; Sheerah, Haytham A.; Alsalamah, Sara A.; Alsalamah, Hessah A. (MDPI, 2025-06-22)
The coronavirus pandemic has spread globally, affecting over 700 million people and resulting in over 7 million deaths. In response, global pharmaceutical companies and disease control centers have urgently sought effective treatments and vaccines. However, the rise of counterfeit drugs has become a significant concern amid this urgency. To standardize the legal provision and usage of genetic resources, the United Nations Development Program (UNDP) introduced the Nagoya Protocol. Despite advancements in drug research, the production process remains tedious, complex and vulnerable to fraud. FairChain addresses this pressing challenge by creating a transparent ecosystem that builds trust among all stakeholders throughout the Drug Development Life Cycle (DDLC) by using decentralized, immutable, and transparent blockchain technology. This makes FairChain the first digital health tool to implement the principles of the UNDP’s Nagoya Protocol among all stakeholders throughout all DDLC stages, starting with sample collection, to discovery and development, to preclinical research, to clinical development, to regulator review, and ending with post-market monitoring. Therefore, FairChain allows pharmaceutical companies to document the entire drug production process, landowners to monitor bio-samples from their land, doctors to share clinical research, and regulatory agencies such as the Food and Drug Authority to oversee samples and authorize production. FairChain should enhance transparency, foster trust and efficiency, and ensure a fair and traceable DDLC. To date, no blockchain-based framework has addressed the integration of traceability, auditability, and Nagoya Protocol compliance within a unified system architecture. This paper introduces FairChain, a system that formalizes these requirements in a modular, policy-aligned, and verifiable digital trust infrastructure.
Understanding Narratives of Trauma on Social Media
Saxena, Mansi; Garg, Vaibhav; Ray, Bhaskar; Mishra, Aura; Singh, Munindar (ACM, 2025-05-20)
Background: Victims of domestic and sexual violence often share their narratives on social media. Doing so helps them access validation, solidarity, and support from external sources, which has been shown to enhance resilience and facilitate healing. Problem Statement: We address two aspects of such narratives of trauma: (1) identifying causal relationships between narrative elements and (2) analyzing the effect of such elements on social support received. Method: We retrieved 5561 such narratives from Reddit, a popular online platform. We applied Large Language Models to extract features from these narratives and analyzed them computationally. Findings: Our analysis reveals that prolonged abuse increases selfblame and reduces the intent to seek legal advice; the presence of support increases the likelihood of a victim adopting coping strategies; night-time abuse and intoxication are strongly associated with higher rates of violence; victims experiencing nightmares are more likely to provide detailed descriptions of their abusers; suffering economic and familial abuse increases the support received online. Our research thus corroborates leading psychological theories of narrative, social support, and resilience in online stories and contributes to understanding trauma narratives. In this way, our research can facilitate enhanced social support for victims.
Toward Real-Time Posture Classification: Reality Check
Zhang, Hongbo; Gračanin, Denis; Zhou, Wenjing; Dudash, Drew; Rushton, Gregory (MDPI, 2025-05-05)
Fall prevention has always been a crucial topic for injury prevention. Research shows that real-time posture monitoring and subsequent fall prevention are important for the prevention of fall-related injuries. In this research, we determine a real-time posture classifier by comparing classical and deep machine learning classifiers in terms of their accuracy and robustness for posture classification. For this, multiple classical classifiers, including classical machine learning, support vector machine, random forest, neural network, and Adaboost methods, were used. Deep learning methods, including LSTM and transformer, were used for posture classification. In the experiment, joint data were obtained using an RGBD camera. The results show that classical machine learning posture classifier accuracy was between 75% and 99%, demonstrating that the use of classical machine learning classification alone is sufficient for real-time posture classification even with missing joints or added noise. The deep learning method LSTM was also effective in classifying the postures with high accuracy, despite incurring a significant computational overhead cost, thus compromising the real-time posture classification performance. The research thus shows that classical machine learning methods are worthy of our attention, at least, to consider for reuse or reinvention, especially for real-time posture classification tasks. The insight of using a classical posture classifier for large-scale human posture classification is also given through this research.
Performance Evaluation of Large Language Models for High-Performance Code Generation: A Multi-Agent Approach (MARCO)
Rahman, Asif; Cvetkovic, Veljko; Reece, Kathleen; Walters, Aidan; Hassan, Yasir; Tummeti, Aneesh; Torres, Brian; Cooney, Denise; Ellis, Margaret; Nikolopoulos, Dimitrios (2025-05-07)
Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high-performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture-specific considerations that general-purpose LLMs often overlook. We present MARCO (Multi-Agent Reactive Code Optimizer), a novel framework that enhances LLM-generated code for HPC through a specialized multi-agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO's web-search component that retrieves real-time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web-search component yields a 30.9% performance improvement over the base MARCO system. These results highlight the potential of multi-agent systems to address the specialized requirements of high-performance code generation, offering a cost-effective alternative to domain-specific model fine-tuning.
A Quantum Key Distribution Routing Scheme for a Zero-Trust QKD Network System: A Moving Target Defense Approach
Ghourab, Esraa M.; Azab, Mohamed; Gračanin, Denis (MDPI, 2025-03-26)
Quantum key distribution (QKD), a key application of quantum information technology and “one-time pad” (OTP) encryption, enables secure key exchange with information-theoretic security, meaning its security is grounded in the laws of physics rather than computational assumptions. However, in QKD networks, achieving long-distance communication often requires trusted relays to mitigate channel losses. This reliance introduces significant challenges, including vulnerabilities to compromised relays and the high costs of infrastructure, which hinder widespread deployment. To address these limitations, we propose a zero-trust spatiotemporal diversification framework for multipath–multi-key distribution. The proposed approach enhances the security of end-to-end key distribution by dynamically shuffling key exchange routes, enabling secure multipath key distribution. Furthermore, it incorporates a dynamic adaptive path recovery mechanism that leverages a recursive penalty model to identify and exclude suspicious or compromised relay nodes. To validate this framework, we conducted extensive simulations and compared its performance against established multipath QKD methods. The results demonstrate that the proposed approach achieves a 97.22% lower attack success rate with 20% attacker pervasiveness and a 91.42% reduction in the attack success rate for single key transmission. The total security percentage improves by 35% under 20% attacker pervasiveness, and security enhancement reaches 79.6% when increasing QKD pairs. Additionally, the proposed scheme exhibits an 86.04% improvement in defense against interception and nearly doubles the key distribution success rate compared to traditional methods. The results demonstrate that the proposed approach significantly improves both security robustness and efficiency, underscoring its potential to advance the practical deployment of QKD networks.
Detecting Credit-Seeking Behavior with Programmed Instruction Framesets in a Formal Languages Course
Elnady, Yusuf; Farghally, Mohammed; Mohammed, Mostafa; Shaffer, Clifford A. (MDPI, 2025-03-31)
When students use an online eTextbook with content and interactive graded exercises, they often display aspects of two types of behavior: credit-seeking and knowledge-seeking. A student might behave to some degree in either or both ways with given content. In this work, we attempt to detect the degree to which either behavior takes place and investigate relationships with student performance. Our testbed is an eTextbook for teaching Formal Languages, an advanced Computer Science course. This eTextbook uses Programmed Instruction framesets (slideshows with frequent questions interspersed to keep students engaged) to deliver a significant portion of the material. We analyze session interactions to detect credit-seeking incidents in two ways. We start with an unsupervised machine learning model that clusters behavior in work sessions based on sequences of user interactions. Then, we perform a fine-grained analysis where we consider the type of each question presented within the frameset (these can be multi-choice, single-choice, or T/F questions). Our study involves 219 students, 224 framesets, and 15,521 work sessions across three semesters. We find that credit-seeking behavior is correlated with lower learning outcomes for students. We also find that the type of question is a key factor in whether students use credit-seeking behavior. The implications of our research suggest that educational software should be designed to minimize opportunities for credit-seeking behavior and promote genuine engagement with the material.
Verbal and nonverbal communication differences between in-person and live-streamed group physical activity: a specific investigation into yoga instruction
Islam, Md Shafiqul; Harden, Samantha M.; Lee, Sang Won; Lim, Sol (Taylor & Francis, 2025-03-25)
Live-streamed group exercise programmes are used increasingly to start or maintain physical activity behaviours. However, participants’ performance and long-term engagement can be impacted in live-streamed programmes due to communication gaps with the instructors and peers. We analysed verbal and nonverbal communication differences between in-person and live-streamed group yoga classes to understand current challenges in live-streamed exercise programmes. A total of 29 group yoga class videos (14 live-streamed and 15 in-person) were annotated and analysed for verbal and nonverbal communication events. We found significantly reduced individual feedback in live-streamed vs. in-person sessions, while the trend was reversed for physical pose demonstrations. Environmental incidents hindered fluent communication in live-streamed sessions. The study identified interaction gaps in live-streamed group physical training in both verbal and nonverbal communication, especially in relation to providing personalised feedback for performance improvement. Potential ways to improve instructor-participant interactions in future designs for remote exercise platforms are discussed.
Mixed-reality art as shared experience for cross-device users: Materialize, understand, and explore
Moon, Hayoun; Saade, Mia; Enriquez, Daniel; Duer, Zachary R.; Moon, Hye Sung; Lee, Sang Won; Jeon, Myounghoon (Academic Press - Elsevier, 2024-10)
Virtual reality (VR) has opened new possibilities for creative expression, while the 360-degree head-worn display (HWD) delivers a fully immersive experience in the world of art. The immersiveness, however, comes with the cost of blocking out the physical world, including bystanders without an HWD. Therefore, VR experiences in public (e.g., galleries, museums) often lack social interactivity, which plays an important role in forming aesthetic experiences. In the current study, we explored the application of a crossdevice mixed reality (MR) platform in the domain of art to enable social and inclusive experiences with artworks that utilize VR technology. Our concept of interest features co-located audiences of HWD and mobile device users who interact across physical and virtual worlds. We conducted focus groups (N=22) and expert interviews (N=7) to identify the concept’s potential scenarios and fundamental components, as well as expected benefits and concerns. We also share our process of creating In-Between Spaces, an interactive artwork in MR that encourages social interactivity among crossdevice audiences. Our exploration presents a prospective direction for future VR/MR aesthetic content, especially at public events and exhibitions targeting crowd audiences.
CLOSUREX: Compiler Support for Correct Persistent Fuzzing
Ranjan, Rishi; Paterson, Ian; Hicks, Matthew (ACM, 2025-02-03)
Fuzzing is a widely adopted and pragmatic methodology for bug hunting as a means of software hardening. Research reveals that increasing fuzzing throughput directly increases bug discovery rate. The highest performance fuzzing strategy is persistent fuzzing, which reuses a single process for all test cases by looping back to the start upon completion, instead of exiting. This eliminates all process creation, initialization, and tear-down costs—which are on-par with execution cost. Unfortunately, persistent fuzzing leads to semantically inconsistent program states because process state changes from one test case remain for subsequent test cases. This semantic inconsistency results in missed crashes, false crashes, and overall incorrectness that undermines fuzzer effectiveness. We observe that existing fuzzing execution mechanisms exist on a continuum, based on the amount of state that gets discarded and restored between test cases. We present ClosureX, a fuzzing execution mechanism that sits at a new spot on this state restoration continuum, where only testcase- execution-specific state is reset. This fine-grain state restoration provides near-persistent performance with the correctness of heavyweight state restoration. We construct ClosureX as a set of LLVM passes that integrate with AFL++. Our evaluation on ten popular open-source fuzzing targets show that ClosureX maintains semantic correctness, while increasing test case execution rate by over 3.5x, on average, compared to AFL++. ClosureX also finds bugs more consistently and 1.9x faster than AFL++, with ClosureX discovering 15 0-day bugs (4 CVEs).
Systematic CXL Memory Characterization and Performance Analysis at Scale
Liu, Jinshu; Hadian, Hamid; Wang, Yuyue; Berger, Daniel; Nguyen, Marie; Jian, Xun; Noh, Sam; Li, Huaicheng (ACM, 2025-03-30)
Compute Express Link (CXL) has emerged as a pivotal interconnect for memory expansion. Despite its potential, the performance implications of CXL across devices, latency regimes, processors, and workloads remain underexplored. We present Melody, a framework for systematic characterization and analysis of CXL memory performance. Melody builds on an extensive evaluation spanning 265 workloads, 4 real CXL devices, 7 latency levels, and 5 CPU platforms. Melody yields many insights: workload sensitivity to sub-μs CXL latencies (140-410ns), the first disclosure of CXL tail latencies, CPU tolerance to CXL latencies, a novel approach (Spa) for pinpointing CXL bottlenecks, and CPU prefetcher inefficiencies under CXL.
PhasePrint: Exposing Cloud FPGA Fingerprints by Inducing Timing Faults at Runtime
Mahmod, Jubayer; Hicks, Matthew (ACM, 2025-03-30)
Cloud FPGAs, with their scalable and flexible nature, are rapidly gaining traction as go-to hardware acceleration platforms for compute-intensive workloads. However, their increasing adoption introduces unique security challenges. The hardware-level access that FPGAs provide leads to many vulnerabilities, including the leakage of sensitive information through data remanence and the creation of analog-domain covert channels among users. A foundational requirement in these scenarios is the ability to target an individual FPGA; knowing this, cloud vendors prevent FPGA localization by restricting access to low-level information of the underlying hardware. Beyond aiding adversaries, FPGA localization enables defenders to strategically rotate FPGA usage, preventing prolonged exposure that can lead to confidential data leakage due to long-term data remanence. This paper introduces PhasePrint, a cloud FPGA localization approach using dynamic timing faults in functionally valid circuits. PhasePrint induces timing faults in a specially crafted circuit at runtime and infers delay characteristics from the resulting error pattern—without incorporating information sources blocked by cloud vendors. PhasePrint utilizes an FPGA’s internal clock synthesizer to derive a clock pair with a strict phase relationship. By adjusting the phase relationship of these clocks, PhasePrint intentionally causes timing faults at runtime that reveal manufacturing variations among FPGA chips. We transform these fault locations into feature vectors to create device signatures and train a multiclass classifier on a dataset from 300 unique FPGAs across four AWS geographic regions. This entirely on-chip signature extraction method achieves >99% accuracy, operates 13× faster, and costs 92% less than the state-of-the-art.
Solid State Drive Targeted Memory-Efficient Indexing for Universal I/O Patterns and Fragmentation Degrees
Im, Junsu; Kim, Jeonggyun; Oh, Seonggyun; Koo, Jinhyung; Park, Juhyung; Chwa, Hoon Sung; Noh, Sam H.; Lee, Sungjin (ACM, 2025-03-30)
Thanks to the advance of device scaling technologies, the capacity of SSDs is rapidly increasing. Such increase, however, comes at the cost of a huge index table requiring large DRAM. To provide reasonable performance with less DRAM, various index structures exploiting locality and regularity of I/O references have been proposed. However, they provide deteriorated performance depending on I/O patterns and storage fragmentation. This paper proposes a novel approximate index structure, called AppL, which combines memoryefficient approximate indices and an LSM-tree that has an append-only and sorted nature. AppL reduces the index size to 6∼8-bits per entry, which is considerably smaller than the typical index structures requiring 32∼64-bits, and maintains such high memory efficiency irrespective of locality and fragmentation. By alleviating memory pressure, AppL achieves 33.6∼72.4% shorter read latency and 28.4%∼83.4% higher I/O throughput than state-of-the-art techniques.
Enhancing Immersive Sensemaking with Gaze-Driven Recommendation Cues
Tahmid, Ibrahim Asadullah; North, Christopher L.; Davidson, Kylie; Whitley, Kirsten; Bowman, Douglas A. (ACM, 2025-03-24)
Sensemaking is a complex task that places a heavy cognitive demand on individuals. With the recent surge in data availability, making sense of vast amounts of information has become a significant challenge for many professionals, such as intelligence analysts. Immersive technologies such as mixed reality offer a potential solution by providing virtually unlimited space to organize data. However, the difficulty of processing, filtering relevant information, and synthesizing insights remains. We proposed using eye-tracking data from mixed reality head-worn displays to derive the analyst’s perceived interest in documents and words, and convey that part of the mental model to the analyst. The global interest of the documents is reflected in their color, and their order on the list, while the local interest of the documents is used to generate focused recommendations for a document. To evaluate these recommendation cues, we conducted a user study with two conditions: a gaze-aware system, EyeST, and a “Freestyle” system without gaze-based visual cues. Our findings reveal that the EyeST helped analysts stay on track by reading more essential information while avoiding distractions. However, this came at the cost of reduced focused attention and perceived system performance. The results of our study highlight the need for explainable AI in human-AI collaborative sensemaking to build user trust and encourage the integration of AI outputs into the immersive sensemaking process. Based on our findings, we offer a set of guidelines for designing gaze-driven recommendation cues in an immersive environment.
KHAIT: K-9 Handler Artificial Intelligence Teaming for Collaborative Sensemaking
Wilchek, Matthew; Wang, Linhan; Dickinson, Sally; Feuerbacher, Erica N.; Luther, Kurt; Batarseh, Feras A. (ACM, 2025-03-24)
In urban search and rescue (USAR) operations, communication between handlers and specially trained canines is crucial but often complicated by challenging environments and the specific behaviors canines are trained to exhibit when detecting a person. Since a USAR canine often works out of sight of the handler, the handler lacks awareness of the canine’s location and situation, known as the “sensemaking gap.” In this paper, we propose KHAIT, a novel approach to close the sensemaking gap and enhance USAR effectiveness by integrating object detection-based Artificial Intelligence (AI) and Augmented Reality (AR). Equipped with AI-powered cameras, edge computing, and AR headsets, KHAIT enables precise and rapid object detection from a canine’s perspective, improving survivor localization. We evaluate this approach in a real-world USAR environment, demonstrating an average survival allocation time decrease of 22%, enhancing the speed and accuracy of operations.
Mental Models of Generative AI Chatbot Ecosystems
Wang, Xingyi; Wang, Xiaozheng; Park, Sunyup; Yao, Yaxing (ACM, 2025-03-24)
The capability of GenAI-based chatbots, such as ChatGPT and Gemini, has expanded quickly in recent years, turning them into GenAI Chatbot Ecosystems. Yet, users’ understanding of how such ecosystems work remains unknown. In this paper, we investigate users’ mental models of how GenAI Chatbot Ecosystems work. This is an important question because users’ mental models guide their behaviors, including making decisions that impact their privacy. Through 21 semi-structured interviews, we uncovered users’ four mental models towards first-party (e.g., Google Gemini) and third-party (e.g., ChatGPT) GenAI Chatbot Ecosystems. These mental models centered around the role of the chatbot in the entire ecosystem.We further found that participants held a more consistent and simpler mental model towards third-party ecosystems than the first-party ones, resulting in higher trust and fewer concerns towards the thirdparty ecosystems. We discuss the design and policy implications based on our results.
CLEAR: Towards Contextual LLM-Empowered Privacy Policy Analysis and Risk Generation for Large Language Model Applications
Chen, Chaoran; Zhou, Daodao; Ye, Yanfang; Li, Toby; Yao, Yaxing (ACM, 2025-03-24)
The rise of end-user applications powered by large language models (LLMs), including both conversational interfaces and add-ons to existing graphical user interfaces (GUIs), introduces new privacy challenges. However, many users remain unaware of the risks. This paper explores methods to increase user awareness of privacy risks associated with LLMs in end-user applications. We conducted five co-design workshops to uncover user privacy concerns and their demand for contextual privacy information within LLMs. Based on these insights, we developed CLEAR (Contextual LLM-Empowered Privacy Policy Analysis and Risk Generation), a just-in-time contextual assistant designed to help users identify sensitive information, summarize relevant privacy policies, and highlight potential risks when sharing information with LLMs. We evaluated the usability and usefulness of CLEAR across two example domains: ChatGPT and the Gemini plugin in Gmail. Our findings demonstrated that CLEAR is easy to use and improves users’ understanding of data practices and privacy risks. We also discussed LLM’s duality in posing and mitigating privacy risks, offering design and policy implications.
A Comprehensive Indoor Environment Dataset from Single-Family Houses in the US
Anik, Sheik Murad Hassan; Gao, Xinghua; Meng, Na (MDPI, 2025-03-05)
The paper describes a dataset comprising indoor environmental factors such as temperature, humidity, air quality, and noise levels. The data were collected from 10 sensing devices installed in various locations within three single-family houses in Virginia, USA. The objective of the data collection was to study the indoor environmental conditions of the houses over time. The data were collected at a frequency of one record per minute for a year, combining to a total over 2.5 million records. The paper provides actual floor plans with sensor placements to aid researchers and practitioners in creating reliable building performance models. The techniques used to collect and verify the data are also explained in the paper. The resulting dataset can be employed to enhance models for building energy consumption, occupant behavior, predictive maintenance, and other relevant purposes.
Deep Learning Ensemble Approach for Predicting Expected and Confidence Levels of Signal Phase and Timing Information at Actuated Traffic Signals
Eteifa, Seifeldeen; Shafik, Amr; Eldardiry, Hoda; Rakha, Hesham A. (MDPI, 2025-03-07)
Predicting Signal Phase and Timing (SPaT) information and confidence levels is needed to enhance Green Light Optimal Speed Advisory (GLOSA) and/or Eco-Cooperative Adaptive Cruise Control (Eco-CACC) systems. This study proposes an architecture based on transformer encoders to improve prediction performance. This architecture is combined with different deep learning methods, including Multilayer Perceptrons (MLP), Long-Short-Term Memory neural networks (LSTM), and Convolutional Long-Short-Term Memory neural networks (CNNLSTM) to form an ensemble of predictors. The ensemble is used to make data-driven predictions of SPaT information obtained from traffic signal controllers for six different intersections along the Gallows Road corridor in Virginia. The study outlines three primary tasks. Task one is predicting whether a phase would change within 20 s. Task two is predicting the exact change time within 20 s. Task three is assigning a confidence level to that prediction. The experiments show that the proposed transformer-based architecture outperforms all the previously used deep learning methods for the first two prediction tasks. Specifically, for the first task, the transformer encoder model provides an average accuracy of 96%. For task two, the transformer encoder models provided an average mean absolute error (MAE) of 1.49 s, compared to 1.63 s for other models. Consensus between models is shown to be a good leading indicator of confidence in ensemble predictions. The ensemble predictions with the highest level of consensus are within one second of the true value for 90.2% of the time as opposed to those with the lowest confidence level, which are within one second for only 68.4% of the time.
Impossibility of adversarial self-testing and secure sampling
Bansal, Akshay; Singh Arora, Atul; Van Himbeeck, Thomas; Sikora, Jamie (American Physical Society, 2024-08-21)
Self-testing is the task where spatially separated Alice and Bob cooperate to deduce the inner workings of untrusted quantum devices by interacting with them in a classical manner. We examine the task above where Alice and Bob do not trust each other which we call adversarial self-testing.We show that adversarial self-testing implies secure sampling—a simpler task that we introduce where distrustful Alice and Bob wish to sample from a joint probability distribution with the guarantee that an honest party’s marginal is not biased. By extending impossibility results in two-party quantum cryptography, we give a simple proof that both of these tasks are impossible in all but trivial settings.
Low responsiveness of machine learning models to critical or deteriorating health conditions
Pias, Tanmoy Sarkar; Afrose, Sharmin; Tuli, Moon Das; Trisha, Ipsita Hamid; Deng, Xinwei; Nemeroff, Charles B.; Yao, Danfeng Daphne (Springer Nature, 2025-03-11)
Background: Machine learning (ML) based mortality prediction models can be immensely useful in intensive care units. Such a model should generate warnings to alert physicians when a patient’s condition rapidly deteriorates, or their vitals are in highly abnormal ranges. Before clinical deployment, it is important to comprehensively assess a model’s ability to recognize critical patient conditions. Methods: We develop multiple medical ML testing approaches, including a gradient ascent method and neural activation map. We systematically assess these machine learning models’ ability to respond to serious medical conditions using additional test cases, some of which are time series. Guided by medical doctors, our evaluation involves multiple machine learning models, resampling techniques, and four datasets for two clinical prediction tasks. Results: We identify serious deficiencies in the models’ responsiveness, with the models being unable to recognize severely impaired medical conditions or rapidly deteriorating health. For in-hospital mortality prediction, the models tested using our synthesized cases fail to recognize 66% of the injuries. In some instances, the models fail to generate adequate mortality risk scores for all test cases. Our study identifies similar kinds of deficiencies in the responsiveness of 5-year breast and lung cancer prediction models. Conclusions: Using generated test cases, we find that statistical machine-learning models trained solely from patient data are grossly insufficient and have many dangerous blind spots. Most of the ML models tested fail to respond adequately to critically ill patients. How to incorporate medical knowledge into clinical machine learning models is an important future research direction.

Browse

Recent Submissions