Automatic Question Answering and Knowledge Discovery from Electronic Health Records

Files
TR Number
Date
2021-08-25
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

Electronic Health Records (EHR) data contain comprehensive longitudinal patient information, which is usually stored in databases in the form of either multi-relational structured tables or unstructured texts, e.g., clinical notes. EHR provides a useful resource to assist doctors' decision making, however, they also present many unique challenges that limit the efficient use of the valuable information, such as large data volume, heterogeneous and dynamic information, medical term abbreviations, and noisy nature caused by misspelled words.

This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to seek answers from EHR for clinical activity related questions posed in human language without the assistance of database and natural language processing (NLP) domain experts, (2) How to discover underlying relationships of different events and entities in structured tabular EHRs, and (3) How to predict when a medical event will occur and estimate its probability based on previous medical information of patients.

First, to automatically retrieve answers for natural language questions from the structured tables in EHR, we study the question-to-SQL generation task by generating the corresponding SQL query of the input question. We propose a translation-edit model driven by a language generation module and an editing module for the SQL query generation task. This model helps automatically translate clinical activity related questions to SQL queries, so that the doctors only need to provide their questions in natural language to get the answers they need. We also create a large-scale dataset for question answering on tabular EHR to simulate a more realistic setting. Our performance evaluation shows that the proposed model is effective in handling the unique challenges about clinical terminologies, such as abbreviations and misspelled words.

Second, to automatically identify answers for natural language questions from unstructured clinical notes in EHR, we propose to achieve this goal by querying a knowledge base constructed based on fine-grained document-level expert annotations of clinical records for various NLP tasks. We first create a dataset for clinical knowledge base question answering with two sets: clinical knowledge base and question-answer pairs. An attention-based aspect-level reasoning model is developed and evaluated on the new dataset. Our experimental analysis shows that it is effective in identifying answers and also allows us to analyze the impact of different answer aspects in predicting correct answers.

Third, we focus on discovering underlying relationships of different entities (e.g., patient, disease, medication, and treatment) in tabular EHR, which can be formulated as a link prediction problem in graph domain. We develop a self-supervised learning framework for better representation learning of entities across a large corpus and also consider local contextual information for the down-stream link prediction task. We demonstrate the effectiveness, interpretability, and scalability of the proposed model on the healthcare network built from tabular EHR. It is also successfully applied to solve link prediction problems in a variety of domains, such as e-commerce, social networks, and academic networks.

Finally, to dynamically predict the occurrence of multiple correlated medical events, we formulate the problem as a temporal (multiple time-points) and multi-task learning problem using tensor representation. We propose an algorithm to jointly and dynamically predict several survival problems at each time point and optimize it with the Alternating Direction Methods of Multipliers (ADMM) algorithm. The model allows us to consider both the dependencies between different tasks and the correlations of each task at different time points. We evaluate the proposed model on two real-world applications and demonstrate its effectiveness and interpretability.

Description
Keywords
Electronic Health Records, Question Answering, Knowledge Discovery, Knowledge Graph, Survival Analysis
Citation