Browsing by Author "Eldardiry, Hoda"
Now showing 1 - 20 of 23
Results Per Page
Sort Options
- ACADIA: Efficient and Robust Adversarial Attacks Against Deep Reinforcement LearningAli, Haider (Virginia Tech, 2023-01-05)Existing adversarial algorithms for Deep Reinforcement Learning (DRL) have largely focused on identifying an optimal time to attack a DRL agent. However, little work has been explored in injecting efficient adversarial perturbations in DRL environments. We propose a suite of novel DRL adversarial attacks, called ACADIA, representing AttaCks Against Deep reInforcement leArning. ACADIA provides a set of efficient and robust perturbation-based adversarial attacks to disturb the DRL agent's decision-making based on novel combinations of techniques utilizing momentum, ADAM optimizer (i.e., Root Mean Square Propagation or RMSProp), and initial randomization. These kinds of DRL attacks with novel integration of such techniques have not been studied in the existing Deep Neural Networks (DNNs) and DRL research. We consider two well-known DRL algorithms, Deep-Q Learning Network (DQN) and Proximal Policy Optimization (PPO), under Atari games and MuJoCo where both targeted and non-targeted attacks are considered with or without the state-of-the-art defenses in DRL (i.e., RADIAL and ATLA). Our results demonstrate that the proposed ACADIA outperforms existing gradient-based counterparts under a wide range of experimental settings. ACADIA is nine times faster than the state-of-the-art Carlini and Wagner (CW) method with better performance under defenses of DRL.
- Building Energy Profile Clustering Based on Energy Consumption PatternsAfzalan, Milad (Virginia Tech, 2020-06)With the widespread adoption of smart meters in buildings, an unprecedented amount of high- resolution energy data is released, which provides opportunities to understand building consumption patterns. Accordingly, research efforts have employed data analytics and machine learning methods for the segmentation of consumers based on their load profiles, which help utilities and energy providers for customized/personalized targeting for energy programs. However, building energy segmentation methodologies may present oversimplified representations of load shapes, which do not properly capture the realistic energy consumption patterns, in terms of temporal shapes and magnitude. In this thesis, we introduce a clustering technique that is capable of preserving both temporal patterns and total consumption of load shapes from customers’ energy data. The proposed approach first overpopulates clusters as the initial stage to preserve the accuracy and merges the similar ones to reduce redundancy in the second stage by integrating time-series similarity techniques. For such a purpose, different time-series similarity measures based on Dynamic Time Warping (DTW) are employed. Furthermore, evaluations of different unsupervised clustering methods such as k-means, hierarchical clustering, fuzzy c-means, and self-organizing map were presented on building load shape portfolios, and their performance were quantitatively and qualitatively compared. The evaluation was carried out on real energy data of ~250 households. The comparative assessment (both qualitatively and quantitatively) demonstrated the applicability of the proposed approach compared to benchmark techniques for power time-series clustering of household load shapes. The contribution of this thesis is to: (1) present a comparative assessment of clustering techniques on household electricity load shapes and highlighting the inadequacy of conventional validation indices for choosing the cluster number and (2) propose a two-stage clustering approach to improve the representation of temporal patterns and magnitude of household load shapes.
- Concept Vectors for Zero-Shot Video GenerationDani, Riya Jinesh (Virginia Tech, 2022-06-09)Zero-shot video generation involves generating videos of concepts (action classes) that are not seen in the training phase. Even though the research community has explored conditional video generation for long high-resolution videos, zero-shot video remains a fairly unexplored and challenging task. Most recent works can generate videos for action-object or motion-content pairs, where both the object (content) and action (motion) are observed separately during training, yet results often lack spatial consistency between foreground and background and cannot generalize to complex scenes with multiple objects or actions. In this work, we propose Concept2Vid that generates zero-shot videos for classes that are completely unseen during training. In contrast to prior work, our model is not limited to a predefined fixed set of class-level attributes, but rather utilizes semantic information from multiple videos of the same topic to generate samples from novel classes. We evaluate qualitatively and quantitatively on the Kinetics400 and UCF101 datasets, demonstrating the effectiveness of our proposed model.
- Data-Efficient Learning in Image Synthesis and Instance SegmentationRobb, Esther Anne (Virginia Tech, 2021-08-18)Modern deep learning methods have achieve remarkable performance on a variety of computer vision tasks, but frequently require large, well-balanced training datasets to achieve high-quality results. Data-efficient performance is critical for downstream tasks such as automated driving or facial recognition. We propose two methods of data-efficient learning for the tasks of image synthesis and instance segmentation. We first propose a method of high-quality and diverse image generation from finetuning to only 5-100 images. Our method factors a pretrained model into a small but highly expressive weight space for finetuning, which discourages overfitting in a small training set. We validate our method in a challenging few-shot setting of 5-100 images in the target domain. We show that our method has significant visual quality gains compared with existing GAN adaptation methods. Next, we introduce a simple adaptive instance segmentation loss which achieves state-of-the-art results on the LVIS dataset. We demonstrate that the rare categories are heavily suppressed by textit{correct background predictions}, which reduces the probability for all foreground categories with equal weight. Due to the relative infrequency of rare categories, this leads to an imbalance that biases towards predicting more frequent categories. Based on this insight, we develop DropLoss -- a novel adaptive loss to compensate for this imbalance without a trade-off between rare and frequent categories.
- A Deep Learning Approach to Predict Accident Occurrence Based on Traffic DynamicsKhaghani, Farnaz (Virginia Tech, 2020-05)Traffic accidents are of concern for traffic safety; 1.25 million deaths are reported each year. Hence, it is crucial to have access to real-time data and rapidly detect or predict accidents. Predicting the occurrence of a highway car accident accurately any significant length of time into the future is not feasible since the vast majority of crashes occur due to unpredictable human negligence and/or error. However, rapid traffic incident detection could reduce incident-related congestion and secondary crashes, alleviate the waste of vehicles’ fuel and passengers’ time, and provide appropriate information for emergency response and field operation. While the focus of most previously proposed techniques is predicting the number of accidents in a certain region, the problem of predicting the accident occurrence or fast detection of the accident has been little studied. To address this gap, we propose a deep learning approach and build a deep neural network model based on long short term memory (LSTM). We apply it to forecast the expected speed values on freeways’ links and identify the anomalies as potential accident occurrences. Several detailed features such as weather, traffic speed, and traffic flow of upstream and downstream points are extracted from big datasets. We assess the proposed approach on a traffic dataset from Sacramento, California. The experimental results demonstrate the potential of the proposed approach in identifying the anomalies in speed value and matching them with accidents in the same area. We show that this approach can handle a high rate of rapid accident detection and be implemented in real-time travelers’ information or emergency management systems.
- Examining Faculty and Student Perceptions of Generative AI in University CoursesKim, Junghwan; Klopfer, Michelle; Grohs, Jacob R.; Eldardiry, Hoda; Weichert, James; Cox, Larry A., II; Pike, Dale (Springer, 2025-01-24)As generative artificial intelligence (GenAI) tools such as ChatGPT become more capable and accessible, their use in educational settings is likely to grow. However, the academic community lacks a comprehensive understanding of the perceptions and attitudes of students and instructors toward these new tools. In the Fall 2023 semester, we surveyed 982 students and 76 faculty at a large public university in the United States, focusing on topics such as perceived ease of use, ethical concerns, the impact of GenAI on learning, and differences in responses by role, gender, and discipline. We found that students and faculty did not differ significantly in their attitudes toward GenAI in higher education, except regarding ease of use, hedonic motivation, habit, and interest in exploring new technologies. Students and instructors also used GenAI for coursework or teaching at similar rates, although regular use of these tools was still low across both groups. Among students, we found significant differences in attitudes between males in STEM majors and females in non-STEM majors. These findings underscore the importance of considering demographic and disciplinary diversity when developing policies and practices for integrating GenAI in educational contexts, as GenAI may influence learning outcomes differently across various groups of students. This study contributes to the broader understanding of how GenAI can be leveraged in higher education while highlighting potential areas of inequality that need to be addressed as these tools become more widely used.
- Generating Canonical Sentences from Question-Answer Pairs of Deposition TranscriptsMehrotra, Maanav (Virginia Tech, 2020-09-15)In the legal domain, documents of various types are created in connection with a particular case, such as testimony of people, transcripts, depositions, memos, and emails. Deposition transcripts are one such type of legal document, which consists of conversations between the different parties in the legal proceedings that are recorded by a court reporter. Court reporting has been traced back to 63 B.C. It has transformed from the initial scripts of ``Cuneiform", ``Running Script", and ``Grass Script" to Certified Access Real-time Translation (CART). Since the boom of digitization, there has been a shift to storing these in the PDF/A format. Deposition transcripts are in the form of question-answer (QA) pairs and can be quite lengthy for common people to read. This gives us a need to develop some automatic text-summarization method for the same. The present-day summarization systems do not support this form of text, entailing a need to process them. This creates a need to parse such documents and extract QA pairs as well as any relevant supporting information. These QA pairs can then be converted into complete canonical sentences, i.e., in a declarative form, from which we could extract some insights and use for further downstream tasks. This work investigates the same, as well as using deep-learning techniques for such transformations.
- Graph Deep Factors for Probabilistic Time-series ForecastingChen, Hongjie; Rossi, Ryan; Kim, Sungchul; Mahadik, Kanak; Eldardiry, Hoda (ACM, 2022)Deep probabilistic forecasting techniques can model large collections of time-series. However, recent techniques explicitly assume either complete independence (local model) or complete dependence (global model) between time-series in the collection. This corresponds to the two extreme cases where every time-series is disconnected from every other time-series or likewise, that every time-series is related to every other time-series resulting in a completely connected graph. In this work, we propose a deep hybrid probabilistic graph-based forecasting framework called Graph Deep Factors (GraphDF) that goes beyond these two extremes by allowing nodes and their time-series to be connected to others in an arbitrary fashion. GraphDF is a hybrid forecasting framework that consists of a relational global and relational local model. In particular, a relational global model learns complex non-linear time-series patterns globally using the structure of the graph to improve both forecasting accuracy and computational efficiency. Similarly, instead of modeling every time-series independently, a relational local model not only considers its individual time-series but also the time-series of nodes that are connected in the graph. The experiments demonstrate the effectiveness of the proposed deep hybrid graph-based forecasting model compared to the state-of-the-art methods in terms of its forecasting accuracy, runtime, and scalability. Our case study reveals that GraphDF can successfully generate cloud usage forecasts and opportunistically schedule workloads to increase cloud cluster utilization by 47.5% on average. Furthermore, we target addressing the common nature of many time-series forecasting applications where time-series are provided in a streaming version, however, most methods fail to leverage the newly incoming time-series values and result in worse performance over time. In this paper, we propose an online incremental learning framework for probabilistic forecasting. The framework is theoretically proven to have lower time and space complexity. The framework can be universally applied to many other machine learning-based methods.
- Graph Time-series Modeling in Deep Learning: A SurveyChen, Hongjie; Eldardiry, Hoda (ACM, 2024)Time-series and graphs have been extensively studied for their ubiquitous existence in numerous domains. Both topics have been separately explored in the field of deep learning. For time-series modeling, recurrent neural networks or convolutional neural networks model the relations between values across time steps, while for graph modeling, graph neural networks model the inter-relations between nodes. Recent research in deep learning requires simultaneous modeling for time-series and graphs when both representations are present. For example, both types of modeling are necessary for time-series classification, regression, and anomaly detection in graphs. This paper aims to provide a comprehensive summary of these models, which we call graph time-series models. To the best of our knowledge, this is the first survey paper that provides a picture of related models from the perspective of deep graph time-series modeling to address a range of time-series tasks, including regression, classification, and anomaly detection. Graph time-series models are split into two categories, a) graph recurrent/convolutional neural networks and b) graph attention neural networks. Under each category, we further categorize models based on their properties. Additionally, we compare representative models and discuss how distinctive model characteristics are utilized with respect to various model components and data challenges. Pointers to commonly used datasets and code are included to facilitate access for further research. In the end, we discuss potential directions for future research.
- Human-AI Sensemaking with Semantic Interaction and Deep LearningBian, Yali (Virginia Tech, 2022-03-07)Human-AI interaction can improve overall performance, exceeding the performance that either humans or AI could achieve separately, thus producing a whole greater than the sum of the parts. Visual analytics enables collaboration between humans and AI through interactive visual interfaces. Semantic interaction is a design methodology to enhance visual analytics systems for sensemaking tasks. It is widely applied for sensemaking in high-stakes domains such as intelligence analysis and academic research. However, existing semantic interaction systems support collaboration between humans and traditional machine learning models only; they do not apply state-of-the-art deep learning techniques. The contribution of this work is the effective integration of deep neural networks into visual analytics systems with semantic interaction. More specifically, I explore how to redesign the semantic interaction pipeline to enable collaboration between human and deep learning models for sensemaking tasks. First, I validate that semantic interaction systems with pre-trained deep learning better support sensemaking than existing semantic interaction systems with traditional machine learning. Second, I integrate interactive deep learning into the semantic interaction pipeline to enhance inference ability in capturing analysts' precise intents, thereby promoting sensemaking. Third, I add semantic explanation into the pipeline to interpret the interactively steered deep learning model. With a clear understanding of DL, analysts can make better decisions. Finally, I present a neural design of the semantic interaction pipeline to further boost collaboration between humans and deep learning for sensemaking.
- Improving Deposition Summarization using Enhanced Generation and Extraction of Entities and KeywordsSumant, Aarohi Milind (Virginia Tech, 2021-06-01)In the legal domain, depositions help lawyers and paralegals to record details and recall relevant information relating to a case. Depositions are conversations between a lawyer and a deponent and are generally in Question-Answer (QA) format. These documents can be lengthy, which raises the need for applying summarization methods to the documents. Though many automatic summarization methods are available, not all of them give good results, especially in the legal domain. This creates a need to process the QA pairs and develop methods to help summarize the deposition. For further downstream tasks like summarization and insight generation, converting QA pairs to canonical or declarative form can be helpful. Since the transformed canonical sentences are not perfectly readable, we explore methods based on heuristics, language modeling, and deep learning, to improve the quality of sentences in terms of grammaticality, sentence correctness, and relevance. Further, extracting important entities and keywords from a deposition will help rank the candidate summary sentences and assist with extractive summarization. This work investigates techniques for enhanced generation of canonical sentences and extracting relevant entities and keywords to improve deposition summarization.
- Knowledge-Enhanced Multi-Label Few-Shot Product Attribute-Value ExtractionGong, Jiaying; Chen, Wei-Te; Eldardiry, Hoda (ACM, 2023-10-21)Existing attribute-value extraction (AVE) models require large quantities of labeled data for training. However, new products with new attribute-value pairs enter the market every day in real-world e- Commerce. Thus, we formulate AVE in multi-label few-shot learning (FSL), aiming to extract unseen attribute value pairs based on a small number of training examples. We propose a Knowledge- Enhanced Attentive Framework (KEAF) based on prototypical networks, leveraging the generated label description and category information to learn more discriminative prototypes. Besides, KEAF integrates with hybrid attention to reduce noise and capture more informative semantics for each class by calculating the label-relevant and query-related weights. To achieve multi-label inference, KEAF further learns a dynamic threshold by integrating the semantic information from both the support set and the query set. Extensive experiments with ablation studies conducted on two datasets demonstrate that our proposed model significantly outperforms other SOTA models for information extraction in few-shot learning.
- Large Web Archive Collection Infrastructure and ServicesWang, Xinyue (Virginia Tech, 2023-01-20)The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently. In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.
- Learning with Constraint-Based Weak SupervisionArachie, Chidubem Gibson (Virginia Tech, 2022-04-28)Recent adaptations of machine learning models in many businesses has underscored the need for quality training data. Typically, training supervised machine learning systems involves using large amounts of human-annotated data. Labeling data is expensive and can be a limiting factor in using machine learning models. To enable continued integration of machine learning systems in businesses and also easy access by users, researchers have proposed several alternatives to supervised learning. Weak supervision is one such alternative. Weak supervision or weakly supervised learning involves using noisy labels (weak signals of the data) from multiple sources to train machine learning systems. A weak supervision model aggregates multiple noisy label sources called weak signals in order to produce probabilistic labels for the data. The main allure of weak supervision is that it provides a cheap yet effective substitute for supervised learning without need for labeled data. The key challenge in training weakly supervised machine learning models is that the weak supervision leaves ambiguity about the possible true labelings of the data. In this dissertation, we aim to address the challenge associated with training weakly supervised learning models by developing new weak supervision methods. Our work focuses on learning with constraint-based weak supervision algorithms. Firstly, we will propose an adversarial labeling approach for weak supervision. In this method, the adversary chooses the labels for the data and a model learns by minimising the error made by the adversarial model. Secondly, we will propose a simple constrained based approach that minimises a quadratic objective function in order to solve for the labels of the data. Next we explain the notion of data consistency for weak supervision and propose a data consistent method for weakly supervised learning. This approach combines weak supervision labels with features of the training data to make the learned labels consistent with the data. Lastly, we use this data consistent approach to propose a general approach for improving the performance of weak supervision models. In this method, we combine weak supervision with active learning in order to generate a model that outperforms each individual approach using only a handful of labeled data. For each algorithm we propose, we report extensive empirical validation of it by testing it on standard text and image classification datasets. We compare each approach against baseline and state-of-the-art methods and show that in most cases we match or outperform the methods we compare against. We report significant gains of our method on both binary and multi-class classification tasks.
- Learning-based Optimal Control of Time-Varying Linear Systems Over Large Time IntervalsBaddam, Vasanth Reddy (Virginia Tech, 2023)We solve the problem of two-point boundary optimal control of linear time-varying systems with unknown model dynamics using reinforcement learning. Leveraging singular perturbation theory techniques, we transform the time-varying optimal control problem into two time-invariant subproblems. This allows the utilization of an off-policy iteration method to learn the controller gains. We show that the performance of the learning-based controller approximates that of the model-based optimal controller and the approximation accuracy improves as the control problem’s time horizon increases. We also provide a simulation example to verify the results
- Machine Learning Classification of Gas Chromatography DataClark, Evan Peter (Virginia Tech, 2023-08-28)Gas Chromatography (GC) is a technique for separating volatile compounds by relying on adherence differences in the chemical components of the compound. As conditions within the GC are changed, components of the mixture elute at different times. Sensors measure the elution and produce data which becomes chromatograms. By analyzing the chromatogram, the presence and quantity of the mixture's constituent components can be determined. Machine Learning (ML) is a field consisting of techniques by which machines can independently analyze data to derive their own procedures for processing it. Additionally, there are techniques for enhancing the performance of ML algorithms. Feature Selection is a technique for improving performance by using a specific subset of the data. Feature Engineering is a technique to transform the data to make processing more effective. Data Fusion is a technique which combines multiple sources of data so as to produce more useful data. This thesis applies machine learning algorithms to chromatograms. Five common machine learning algorithms are analyzed and compared, including K-Nearest Neighbour (KNN), Support Vector Machines (SVM), Convolutional Neural Network (CNN), Decision Tree, and Random Forest (RF). Feature Selection is tested by applying window sweeps with the KNN algorithm. Feature Engineering is applied via the Principal Component Analysis (PCA) algorithm. Data Fusion is also tested. It was found that KNN and RF performed best overall. Feature Selection was very beneficial overall. PCA was helpful for some algorithms, but less so for others. Data Fusion was moderately beneficial.
- Multi-Label Zero-Shot Product Attribute-Value ExtractionGong, Jiaying; Eldardiry, Hoda (ACM, 2024-05-13)E-commerce platforms should provide detailed product descriptions (attribute values) for effective product search and recommendation. However, attribute value information is typically not available for new products. To predict unseen attribute values, large quantities of labeled training data are needed to train a traditional supervised learning model. Typically, it is difficult, time-consuming, and costly to manually label large quantities of new product profiles. In this paper, we propose a novel method to efficiently and effectively extract unseen attribute values from new products in the absence of labeled data (zero-shot setting).We propose HyperPAVE, a multilabel zero-shot attribute value extraction model that leverages inductive inference in heterogeneous hypergraphs. In particular, our proposed technique constructs heterogeneous hypergraphs to capture complex higher-order relations (i.e. user behavior information) to learn more accurate feature representations for graph nodes. Furthermore, our proposed HyperPAVE model uses an inductive link prediction mechanism to infer future connections between unseen nodes. This enables HyperPAVE to identify new attribute values without the need for labeled training data. We conduct extensive experiments with ablation studies on different categories of the MAVE dataset. The results demonstrate that our proposed HyperPAVE model significantly outperforms existing classificationbased, generation-based large language models for attribute value extraction in the zero-shot setting.
- N-ary Cross-sentence Relation Extraction: From Supervised to Unsupervised LearningYuan, Chenhan (Virginia Tech, 2021-05-19)Relation extraction is the problem of extracting relations between entities described in the text. Relations identify a common "fact" described by distinct entities. Conventional relation extraction approaches focus on supervised binary intra-sentence relations, where the assumption is relations only exist between two entities within the same sentence. These approaches have two key limitations. First, binary intra-sentence relation extraction methods can not extract a relation in a fact that is described by more than two entities. Second, these methods cannot extract relations that span more than one sentence, which commonly occurs as the number of entities increases. Third, these methods assume a supervised setting and are therefore not able to extract relations in the absence of sufficient labeled data for training. This work aims to overcome these limitations by developing n-ary cross-sentence relation extraction methods for both supervised and unsupervised settings. Our work has three main goals and contributions: (1) two unsupervised binary intra-sentence relation extraction methods, (2) a supervised n-ary cross-sentence relation extraction method, and (3) an unsupervised n-ary cross-sentence relation extraction method. To achieve these goals, our work includes the following contributions: (1) an automatic labeling method for n-ary cross-sentence data, which is essential for model training, (2) a reinforcement learning-based sentence distribution estimator to minimize the impact of noise on model training, (3) a generative clustering-based technique for intra-sentence unsupervised relation extraction, (4) a variational autoencoder-based technique for unsupervised n-ary cross-sentence relation extraction, and (5) a sentence group selector that identifies groups of sentences that form relations.
- Protection and Cybersecurity in Inverter-Based MicrogridsMohammadhassani, Ardavan (Virginia Tech, 2023-07-06)Developing microgrids is an attractive solution for integrating inverter-based resources (IBR) in the power system. Distributed control is a potential strategy for controlling such microgrids. However, a major challenge toward the proliferation of distributed control is cybersecurity. A false data injection (FDI) attack on a microgrid using distributed control can have severe impacts on the operation of the microgrid. Simultaneously, a microgrid needs to be protected from system faults to ensure the safe and reliable delivery of power to loads. However, the irregular response of IBRs to faults makes microgrid protection very challenging. A microgrid is also susceptible to faults inside IBR converters. These faults can remain undetected for a long time and shutdown an IBR. This dissertation first proposes a method that reconstructs communicated signals using their autocorrelation and crosscorrelation measurements to make distributed control more resilient against FDI attacks. Next, this dissertation proposes a protection scheme that works by classifying measured harmonic currents using support vector machines. Finally, this dissertation proposes a protection and fault-tolerant control strategy to diagnose and clear faults that are internal to IBRs. The proposed strategies are verified using time-domain simulation case studies using the PSCAD/EMTDC software package.
- Synthetic Electronic Medical Record Generation using Generative Adversarial NetworksBeyki, Mohammad Reza (Virginia Tech, 2021-08-13)It has been a while that computers have replaced our record books, and medical records are no exception. Electronic Health Records (EHR) are digital version of a patient's medical records. EHRs are available to authorized users, and they contain the medical records of the patient, which should help doctors understand a patient's condition quickly. In recent years, Deep Learning models have proved their value and have become state-of-the-art in computer vision, natural language processing, speech and other areas. The private nature of EHR data has prevented public access to EHR datasets. There are many obstacles to create a deep learning model with EHR data. Because EHR data are primarily consisting of huge sparse matrices, these challenges are mostly unique to this field. Due to this, research in this area is limited, and we can improve existing research substantially. In this study, we focus on high-performance synthetic data generation in EHR datasets. Artificial data generation can help reduce privacy leakage for dataset owners as it is proven that de-identification methods are prone to re-identification attacks. We propose a novel approach we call Improved Correlation Capturing Wasserstein Generative Adversarial Network (SCorGAN) to create EHR data. This work, leverages Deep Convolutional Neural Networks to extract and understand spatial dependencies in EHR data. To improve our model's performance, we focus on our Deep Convolutional AutoEncoder to better map our real EHR data to our latent space where we train the Generator. To assess our model's performance, we demonstrate that our generative model can create excellent data that are statistically close to the input dataset. Additionally, we evaluate our synthetic dataset against the original data using our previous work that focused on GAN Performance Evaluation. This work is publicly available at https://github.com/mohibeyki/SCorGAN