A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Weeks, Connor; Cheruvu, Aravind; Abdullah, Sifat Muhammad; Kanchi, Shravya; Yao, Daphne; Viswanath, Bimal

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

dc.contributor.author	Weeks, Connor	en
dc.contributor.author	Cheruvu, Aravind	en
dc.contributor.author	Abdullah, Sifat Muhammad	en
dc.contributor.author	Kanchi, Shravya	en
dc.contributor.author	Yao, Daphne	en
dc.contributor.author	Viswanath, Bimal	en
dc.date.accessioned	2024-03-01T13:17:50Z	en
dc.date.available	2024-03-01T13:17:50Z	en
dc.date.issued	2023-12-04	en
dc.date.updated	2024-01-01T08:55:55Z	en
dc.description.abstract	Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets. Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots, by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3627106.3627122	en
dc.identifier.uri	https://hdl.handle.net/10919/118225	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.title	A First Look at Toxicity Injection Attacks on Open-domain Chatbots	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3627106.3627122.pdf
Size:: 1.13 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science