A First Look at Toxicity Injection Attacks on Open-domain Chatbots

dc.contributor.authorWeeks, Connoren
dc.contributor.authorCheruvu, Aravinden
dc.contributor.authorAbdullah, Sifat Muhammaden
dc.contributor.authorKanchi, Shravyaen
dc.contributor.authorYao, Daphneen
dc.contributor.authorViswanath, Bimalen
dc.date.accessioned2024-03-01T13:17:50Zen
dc.date.available2024-03-01T13:17:50Zen
dc.date.issued2023-12-04en
dc.date.updated2024-01-01T08:55:55Zen
dc.description.abstractChatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets. Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots, by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.en
dc.description.versionPublished versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.1145/3627106.3627122en
dc.identifier.urihttps://hdl.handle.net/10919/118225en
dc.language.isoenen
dc.publisherACMen
dc.rightsCreative Commons Attribution 4.0 Internationalen
dc.rights.holderThe author(s)en
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.titleA First Look at Toxicity Injection Attacks on Open-domain Chatbotsen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3627106.3627122.pdf
Size:
1.13 MB
Format:
Adobe Portable Document Format
Description:
Published version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: