Defending Against Trojan Attacks on Neural Network-based Language Models
Backdoor (Trojan) attacks are a major threat to the security of deep neural network (DNN) models. They are created by an attacker who adds a certain pattern to a portion of given training dataset, causing the DNN model to misclassify any inputs that contain the pattern. These infected classifiers are called Trojan models and the added pattern is referred to as the trigger. In image domain, a trigger can be a patch of pixel values added to the images and in text domain, it can be a set of words. In this thesis, we propose Trojan-Miner (T-Miner), a defense scheme against such backdoor attacks on text classification deep learning models. The goal of T-Miner is to detect whether a given classifier is a Trojan model or not.
To create T-Miner , our approach is based on a sequence-to-sequence text generation model. T-Miner uses feedback from the suspicious (test) classifier to perturb input sentences such that their resulting class label is changed. These perturbations can be different for each of the inputs. T-Miner thus extracts the perturbations to determine whether they include any backdoor trigger and correspondingly flag the suspicious classifier as a Trojan model.
We evaluate T-Miner on three text classification datasets: Yelp Restaurant Reviews, Twitter Hate Speech, and Rotten Tomatoes Movie Reviews. To illustrate the effectiveness of T-Miner, we evaluate it on attack models over text classifiers. Hence, we build a set of clean classifiers with no trigger in their training datasets and also using several trigger phrases, we create a set of Trojan models. Then, we compute how many of these models are correctly marked by T-Miner. We show that our system is able to detect trojan and clean models with 97% overall accuracy over 400 classifiers. Finally, we discuss the robustness of T-Miner in the case that the attacker knows T-Miner framework and wants to use this knowledge to weaken T-Miner performance. To this end, we propose four different scenarios for the attacker and report the performance of T-Miner under these new attack methods.