A Hitchhiker's Guide to Jailbreaking ChatGPT via Prompt Engineering

Liu, Yi; Deng, Gelei; Xu, Zhengzi; Li, Yuekang; Zheng, Yaowen; Zhang, Ying; Zhao, Lida; Zhang, Tianwei; Wang, Kailong

A Hitchhiker's Guide to Jailbreaking ChatGPT via Prompt Engineering

dc.contributor.author	Liu, Yi	en
dc.contributor.author	Deng, Gelei	en
dc.contributor.author	Xu, Zhengzi	en
dc.contributor.author	Li, Yuekang	en
dc.contributor.author	Zheng, Yaowen	en
dc.contributor.author	Zhang, Ying	en
dc.contributor.author	Zhao, Lida	en
dc.contributor.author	Zhang, Tianwei	en
dc.contributor.author	Wang, Kailong	en
dc.date.accessioned	2024-08-07T12:09:38Z	en
dc.date.available	2024-08-07T12:09:38Z	en
dc.date.issued	2024-07-15	en
dc.date.updated	2024-08-01T07:51:35Z	en
dc.description.abstract	Natural language prompts serve as an essential interface between users and Large Language Models (LLMs) like GPT-3.5 and GPT-4, which are employed by ChatGPT to produce outputs across various tasks. However, prompts crafted with malicious intent, known as jailbreak prompts, can circumvent the restrictions of LLMs, posing a significant threat to systems integrated with these models. Despite their critical importance, there is a lack of systematic analysis and comprehensive understanding of jailbreak prompts. Our paper aims to address this gap by exploring key research questions to enhance the robustness of LLM systems: 1) What common patterns are present in jailbreak prompts? 2) How effectively can these prompts bypass the restrictions of LLMs? 3) With the evolution of LLMs, how does the effectiveness of jailbreak prompts change? To address our research questions, we embarked on an empirical study targeting the LLMs underpinning ChatGPT, one of today’s most advanced chatbots. Our methodology involved categorizing 78 jailbreak prompts into 10 distinct patterns, further organized into three jailbreak strategy types, and examining their distribution.We assessed the effectiveness of these prompts on GPT-3.5 and GPT-4, using a set of 3,120 questions across 8 scenarios deemed prohibited by OpenAI. Additionally, our study tracked the performance of these prompts over a 3-month period, observing the evolutionary response of ChatGPT to such inputs. Our findings offer a comprehensive view of jailbreak prompts, elucidating their taxonomy, effectiveness, and temporal dynamics. Notably, we discovered that GPT-3.5 and GPT-4 could still generate inappropriate content in response to malicious prompts without the need for jailbreaking. This underscores the critical need for effective prompt management within LLM systems and provides valuable insights and data to spur further research in LLM testing and jailbreak prevention.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3663530.3665021	en
dc.identifier.uri	https://hdl.handle.net/10919/120869	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	In Copyright	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.title	A Hitchhiker's Guide to Jailbreaking ChatGPT via Prompt Engineering	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3663530.3665021.pdf
Size:: 1.38 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Computer Science