VTechWorks Repository :: Browsing by Author "Wang, Kailong"

Browsing by Author "Wang, Kailong"

Now showing 1 - 2 of 2

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection
Li, Yuxi; Liu, Yi; Deng, Gelei; Zhang, Ying; Song, Wenjia; Shi, Ling; Wang, Kailong; Li, Yuekang; Liu, Yang; Wang, Haoyu (ACM, 2024-07-12)
With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of “glitch tokens”, which are anomalous tokens produced by established tokenizers and could potentially compromise the models’ quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.
A Hitchhiker's Guide to Jailbreaking ChatGPT via Prompt Engineering
Liu, Yi; Deng, Gelei; Xu, Zhengzi; Li, Yuekang; Zheng, Yaowen; Zhang, Ying; Zhao, Lida; Zhang, Tianwei; Wang, Kailong (ACM, 2024-07-15)
Natural language prompts serve as an essential interface between users and Large Language Models (LLMs) like GPT-3.5 and GPT-4, which are employed by ChatGPT to produce outputs across various tasks. However, prompts crafted with malicious intent, known as jailbreak prompts, can circumvent the restrictions of LLMs, posing a significant threat to systems integrated with these models. Despite their critical importance, there is a lack of systematic analysis and comprehensive understanding of jailbreak prompts. Our paper aims to address this gap by exploring key research questions to enhance the robustness of LLM systems: 1) What common patterns are present in jailbreak prompts? 2) How effectively can these prompts bypass the restrictions of LLMs? 3) With the evolution of LLMs, how does the effectiveness of jailbreak prompts change? To address our research questions, we embarked on an empirical study targeting the LLMs underpinning ChatGPT, one of today’s most advanced chatbots. Our methodology involved categorizing 78 jailbreak prompts into 10 distinct patterns, further organized into three jailbreak strategy types, and examining their distribution.We assessed the effectiveness of these prompts on GPT-3.5 and GPT-4, using a set of 3,120 questions across 8 scenarios deemed prohibited by OpenAI. Additionally, our study tracked the performance of these prompts over a 3-month period, observing the evolutionary response of ChatGPT to such inputs. Our findings offer a comprehensive view of jailbreak prompts, elucidating their taxonomy, effectiveness, and temporal dynamics. Notably, we discovered that GPT-3.5 and GPT-4 could still generate inappropriate content in response to malicious prompts without the need for jailbreaking. This underscores the critical need for effective prompt management within LLM systems and provides valuable insights and data to spur further research in LLM testing and jailbreak prevention.

Browsing by Author "Wang, Kailong"

Results Per Page

Sort Options