Privacy-Preserving Synthetic Medical Data Generation with Deep Learning
MetadataShow full item record
Deep learning models demonstrated good performance in various domains such as ComputerVision and Natural Language Processing. However, the utilization of data-driven methods in healthcare raises privacy concerns, which creates limitations for collaborative research. A remedy to this problem is to generate and employ synthetic data to address privacy concerns. Existing methods for artificial data generation suffer from different limitations, such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Hence, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics, simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, we propose a novel domain-agnostic metric to evaluate the quality of synthetic data. Second, by utilizing 1-D Convolutional Neural Networks, we devise a new approach to capturing the correlation between adjacent diagnosis records. Third, we employ ConvolutionalAutoencoders for creating a robust and compact feature space to handle the mixture of discrete and continuous data. Finally, we devise a privacy-preserving framework that enforcesRényi differential privacy as a new notion of differential privacy.
General Audience Abstract
Computers programs have been widely used for clinical diagnosis but are often designed with assumptions limiting their scalability and interoperability. The recent proliferation of abundant health data, significant increases in computer processing power, and superior performance of data-driven methods enable a trending paradigm shift in healthcare technology. This involves the adoption of artificial intelligence methods, such as deep learning, to improve healthcare knowledge and practice. Despite the success in using deep learning in many different domains, in the healthcare field, privacy challenges make collaborative research difficult, as working with data-driven methods may jeopardize patients' privacy. To overcome these challenges, researchers propose to generate and utilize realistic synthetic data that can be used instead of real private data. Existing methods for artificial data generation are limited by being bound to special use cases. Furthermore, their generalizability to real-world problems is questionable. There is a need to establish valid synthetic data that overcomes privacy restrictions and functions as a real-world analog for healthcare deep learning data training. We propose the use of Generative Adversarial Networks to simultaneously overcome the realism and privacy challenges associated with healthcare data.
- Doctoral Dissertations