InjectBench: An Indirect Prompt Injection Benchmarking Framework

TR Number

Date

2024-08-20

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

The integration of large language models (LLMs) with third party applications has allowed for LLMs to retrieve information from up-to-date or specialized resources. Although this integration offers numerous advantages, it also introduces the risk of indirect prompt injection attacks. In such scenarios, an attacker embeds malicious instructions within the retrieved third party data, which when processed by the LLM, can generate harmful and untruthful outputs for an unsuspecting user. Although previous works have explored how these attacks manifest, there is no benchmarking framework to evaluate indirect prompt injection attacks and defenses at scale, limiting progress in this area. To address this gap, we introduce InjectBench, a framework that empowers the community to create and evaluate custom indirect prompt injection attack samples. Our study demonstrate that InjectBench has the capabilities to produce high quality attack samples that align with specific attack goals, and that our LLM evaluation method aligns with human judgement. Using InjectBench, we investigate the effects of different components of an attack sample on four LLM backends, and subsequently use this newly created dataset to do preliminary testing on defenses against indirect prompt injections. Experiment results suggest that while more capable models are susceptible to attacks, they are better equipped at utilizing defense strategies. To summarize, our work helps the research community to systematically evaluate features of attack samples and defenses by introducing a dataset creation and evaluation framework.

Description

Keywords

Large Language Models, Security, Indirect Prompt Injection

Citation

Collections