Leveraging a Generative Model to Protect Machine Learning Frameworks
Disciplines
Artificial Intelligence and Robotics | Information Security
Abstract (300 words maximum)
With its countless applications, machine learning has become an integral part of our lives. Moreover, our increasing dependency on machine learning applications presents a growing need to safeguard sensitive data. Machine learning models, dependent on large datasets, can inadvertently memorize training data, making them vulnerable to threats like model inversion and membership inference attacks. For example, in model inversion attacks, even with only public API access, attackers can potentially reconstruct training samples.
This research aims to propose a privacy preservation approach from a different perspective that is to protect the privacy of training data samples from the source. We investigate the feasibility of training machine learning models using only synthetic data produced by Generative Adversarial Networks (GANs), eliminating the use of real data samples. Given the sensitivity of medical data, we employ the CheXpert dataset, a standard collection of 2D and 3D biomedical images such as chest X-rays, breast ultrasounds, and abdominal CTs. By utilizing GANs to generate synthetic data for training, we aim to bypass the use of real medical data, thus safeguarding the patient’s private information.
In the experiments, we will evaluate the efficacy of synthetic data against CheXpert data in training machine learning models and gauge the protective capabilities of GANs. Through risk assessments derived from empirical evaluations, we'll employ various inference attack models, such as membership inference and model inversion attacks, to measure the model's security when integrating GANs. Using the CheXpert dataset, we will also examine the potential trade-offs between privacy preservation and the robustness of a machine learning model.
Academic department under which the project should be listed
CCSE - Computer Science
Primary Investigator (PI) Name
Xinyue Zhang
Leveraging a Generative Model to Protect Machine Learning Frameworks
With its countless applications, machine learning has become an integral part of our lives. Moreover, our increasing dependency on machine learning applications presents a growing need to safeguard sensitive data. Machine learning models, dependent on large datasets, can inadvertently memorize training data, making them vulnerable to threats like model inversion and membership inference attacks. For example, in model inversion attacks, even with only public API access, attackers can potentially reconstruct training samples.
This research aims to propose a privacy preservation approach from a different perspective that is to protect the privacy of training data samples from the source. We investigate the feasibility of training machine learning models using only synthetic data produced by Generative Adversarial Networks (GANs), eliminating the use of real data samples. Given the sensitivity of medical data, we employ the CheXpert dataset, a standard collection of 2D and 3D biomedical images such as chest X-rays, breast ultrasounds, and abdominal CTs. By utilizing GANs to generate synthetic data for training, we aim to bypass the use of real medical data, thus safeguarding the patient’s private information.
In the experiments, we will evaluate the efficacy of synthetic data against CheXpert data in training machine learning models and gauge the protective capabilities of GANs. Through risk assessments derived from empirical evaluations, we'll employ various inference attack models, such as membership inference and model inversion attacks, to measure the model's security when integrating GANs. Using the CheXpert dataset, we will also examine the potential trade-offs between privacy preservation and the robustness of a machine learning model.