Browsing by Author "Zeng, Yi"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Adversarial Unlearning of Backdoors via Implicit HypergradientZeng, Yi; Chen, Si; Park, Won; Mao, Morley; Jin, Ming; Jia, Ruoxi (2022)We propose a minimax formulation for removing backdoors from a given poisoned model based on a small set of clean data. This formulation encompasses much of prior work on backdoor removal. We propose the Implicit Bacdoor Adversarial Unlearning (I-BAU) algorithm to solve the minimax. Unlike previous work, which breaks down the minimax into separate inner and outer problems, our algorithm utilizes the implicit hypergradient to account for the interdependence between inner and outer optimization. We theoretically analyze its convergence and the generalizability of the robustness gained by solving minimax on clean data to unseen test data. In our evaluation, we compare I-BAU with six stateof- art backdoor defenses on seven backdoor attacks over two datasets and various attack settings, including the common setting where the attacker targets one class as well as important but underexplored settings where multiple classes are targeted. I-BAU’s performance is comparable to and most often significantly better than the best baseline. Particularly, its performance is more robust to the variation on triggers, attack settings, poison ratio, and clean data size. Moreover, I-BAU requires less computation to take effect; particularly, it is more than 13X faster than the most efficient baseline in the single-target attack setting. Furthermore, it can remain effective in the extreme case where the defender can only access 100 clean samples—a setting where all the baselines fail to produce acceptable results.
- Narcissus: A Practical Clean-Label Backdoor Attack with Limited InformationZeng, Yi; Pan, Minzhou; Just, Hoang Anh; Lyu, Lingjuan; Qiu, Meikang; Jia, Ruoxi (ACM, 2023-11-15)Backdoor attacks introduce manipulated data into a machine learning model's training set, causing the model to misclassify inputs with a trigger during testing to achieve a desired outcome by the attacker. For backdoor attacks to bypass human inspection, it is essential that the injected data appear to be correctly labeled. The attacks with such property are often referred to as "clean-label attacks." The success of current clean-label backdoor methods largely depends on access to the complete training set. Yet, accessing the complete dataset is often challenging or unfeasible since it frequently comes from varied, independent sources, like images from distinct users. It remains a question of whether backdoor attacks still present real threats. In this paper, we provide an affirmative answer to this question by designing an algorithm to launch clean-label backdoor attacks using only samples from the target class and public out-of-distribution data. By inserting carefully crafted malicious examples totaling less than 0.5% of the target class size and 0.05% of the full training set size, we can manipulate the model to misclassify arbitrary inputs into the target class when they contain the backdoor trigger. Importantly, the trained poisoned model retains high accuracy for regular test samples without the trigger, as if the model is trained on untainted data. Our technique is consistently effective across various datasets, models, and even when the trigger is injected into the physical world. We explore the space of defenses and find that Narcissus can evade the latest state-of-the-art defenses in their vanilla form or after a simple adaptation. We analyze the effectiveness of our attack - the synthesized Narcissus trigger contains durable features as persistent as the original target class features. Attempts to remove the trigger inevitably hurt model accuracy first.