Defending Against Trojan Attacks on Neural Network-based Language Models

Azizi, Ahmadreza

Defending Against Trojan Attacks on Neural Network-based Language Models

dc.contributor.author	Azizi, Ahmadreza	en
dc.contributor.committeechair	Viswanath, Bimal	en
dc.contributor.committeecochair	Reddy, Chandan K.	en
dc.contributor.committeemember	Pleimling, Michel J.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2020-07-06T15:38:57Z	en
dc.date.available	2020-07-06T15:38:57Z	en
dc.date.issued	2020-05-15	en
dc.description.abstract	Backdoor (Trojan) attacks are a major threat to the security of deep neural network (DNN) models. They are created by an attacker who adds a certain pattern to a portion of given training dataset, causing the DNN model to misclassify any inputs that contain the pattern. These infected classifiers are called Trojan models and the added pattern is referred to as the trigger. In image domain, a trigger can be a patch of pixel values added to the images and in text domain, it can be a set of words. In this thesis, we propose Trojan-Miner (T-Miner), a defense scheme against such backdoor attacks on text classification deep learning models. The goal of T-Miner is to detect whether a given classifier is a Trojan model or not. To create T-Miner , our approach is based on a sequence-to-sequence text generation model. T-Miner uses feedback from the suspicious (test) classifier to perturb input sentences such that their resulting class label is changed. These perturbations can be different for each of the inputs. T-Miner thus extracts the perturbations to determine whether they include any backdoor trigger and correspondingly flag the suspicious classifier as a Trojan model. We evaluate T-Miner on three text classification datasets: Yelp Restaurant Reviews, Twitter Hate Speech, and Rotten Tomatoes Movie Reviews. To illustrate the effectiveness of T-Miner, we evaluate it on attack models over text classifiers. Hence, we build a set of clean classifiers with no trigger in their training datasets and also using several trigger phrases, we create a set of Trojan models. Then, we compute how many of these models are correctly marked by T-Miner. We show that our system is able to detect trojan and clean models with 97% overall accuracy over 400 classifiers. Finally, we discuss the robustness of T-Miner in the case that the attacker knows T-Miner framework and wants to use this knowledge to weaken T-Miner performance. To this end, we propose four different scenarios for the attacker and report the performance of T-Miner under these new attack methods.	en
dc.description.abstractgeneral	Backdoor (Trojan) attacks are a major threat to the security of predictive models that make use of deep neural networks. The idea behind these attacks is as follows: an attacker adds a certain pattern to a portion of given training dataset and in the next step, trains a predictive model over this dataset. As a result, the predictive model misclassifies any inputs that contain the pattern. In image domain this pattern that is called trigger, can be a patch of pixel values added to the images and in text domain, it can be a set of words. In this thesis, we propose Trojan-Miner (T-Miner), a defense scheme against such backdoor attacks on text classification deep learning models. The goal of T-Miner is to detect whether a given classifier is a Trojan model or not. T-Miner is based on a sequence-to-sequence text generation model that is connected to the given predictive model and determine if the predictive model is being backdoor attacked. When T-Miner is connected to the predictive model, it generates a set of words, called perturbations, and analyses these perturbations to determine whether they include any backdoor trigger. Hence if any part of the trigger is present in the perturbations, the predictive model is flagged as a Trojan model. We evaluate T-Miner on three text classification datasets: Yelp Restaurant Reviews, Twitter Hate Speech, and Rotten Tomatoes Movie Reviews. To illustrate the effectiveness of T-Miner, we evaluate it on attack models over text classifiers. Hence, we build a set of clean classifiers with no trigger in their training datasets and also using several trigger phrases, we create a set of Trojan models. Then, we compute how many of these models are correctly marked by T-Miner. We show that our system is able to detect Trojan models with 97% overall accuracy over 400 predictive models.	en
dc.description.degree	M.S.	en
dc.format.medium	ETD	en
dc.identifier.uri	http://hdl.handle.net/10919/99276	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	Machine learning	en
dc.subject	Deep learning (Machine learning)	en
dc.subject	Artificial Intelligence	en
dc.subject	Generative Adversarial Network	en
dc.subject	GAN	en
dc.subject	Auto Encoder	en
dc.subject	Backdoor Attack	en
dc.subject	Natural Language Processing	en
dc.subject	Text Style Transfer	en
dc.subject	Adversarial Attack	en
dc.subject	IMDB dataset	en
dc.subject	Rotten Tomato dataset	en
dc.subject	Yelp dataset	en
dc.title	Defending Against Trojan Attacks on Neural Network-based Language Models	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Enginnering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	M.S.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Azizi_A_T_2020.pdf
Size:: 3.05 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters Theses