Adversarial Risks and Stereotype Mitigation at Scale in Generative Models

Jha, Akshita

Adversarial Risks and Stereotype Mitigation at Scale in Generative Models

dc.contributor.author	Jha, Akshita	en
dc.contributor.committeechair	Reddy, Chandan K.	en
dc.contributor.committeemember	Prabhakaran, Vinodkumar	en
dc.contributor.committeemember	Blodgett, Su Lin	en
dc.contributor.committeemember	Wang, Xuan	en
dc.contributor.committeemember	Huang, Lifu	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2025-03-08T09:00:11Z	en
dc.date.available	2025-03-08T09:00:11Z	en
dc.date.issued	2025-03-07	en
dc.description.abstract	Generative models have rapidly evolved to produce coherent text, realistic images, and functional code. Yet these remarkable capabilities also expose critical vulnerabilities -- ranging from subtle adversarial attacks to harmful stereotypes -- that pose both technical and societal challenges. This research investigates these challenges across three modalities (code, text, and vision) before focusing on strategies to mitigate biases specifically in generative language models. First, we reveal how programming language (PL) models rely on a `natural channel' of code, such as human-readable tokens and structure, that adversaries can exploit with minimal perturbations. These attacks expose the fragility of state-of-the-art PL models, highlighting how superficial patterns and hidden assumptions in training data can lead to unanticipated vulnerabilities. Extending this analysis to textual and visual domains, we show how over-reliance on patterns seen in training data manifests as ingrained biases and harmful stereotypes. To enable more inclusive and globally representative model evaluations, we introduce SeeGULL, a large-scale benchmark of thousands of stereotypes spanning diverse cultures and identity groups worldwide. We also develop ViSAGe, a benchmark for identifying visual stereotypes at scale in text-to-image (T2I) models, illustrating the persistence of stereotypes in generated images even when prompted otherwise. Building on these findings, we propose two complementary approaches to mitigate stereotypical outputs in language models. The first is an explicit method that uses fairness constraints for model pruning, ensuring essential bias-mitigating features remain intact. The second is an implicit bias mitigation framework that makes a crucial distinction between comprehension failures and inherently learned stereotypes. This approach uses instruction tuning on general-purpose datasets and mitigates stereotypes implicitly without relying on targeted debiasing techniques. Extensive evaluations on state-of-the-art models demonstrate that our methods substantially reduce harmful stereotypes across multiple identity dimensions, while preserving downstream performance.	en
dc.description.abstractgeneral	AI systems, especially generative models that create text, images, and code, have advanced rapidly. They can write essays, generate realistic pictures, and assist with programming. However, these impressive capabilities also come with vulnerabilities that pose both technical and societal challenges. Some of these models can be subtly manipulated into making errors, while others unknowingly reinforce harmful stereotypes present in their training data. This research examines these challenges across three types of generative models: those that generate code, text, and images. First, we investigate how generative models that generate code rely on human-readable patterns that attackers can subtly manipulate, revealing hidden weaknesses in even the most advanced models. Extending this analysis to text and image generation, we show how these models often over-rely on patterns from their training data, leading to harmful stereotypes. To systematically study these issues, we introduce two large-scale benchmarks: SeeGULL, a dataset that identifies stereotypes across cultures and identity groups in AI-generated text, and ViSAGe, a dataset that uncovers hidden biases in AI-generated images. Building on these insights, we propose two complementary solutions to reduce biases in generative language models. The first method explicitly removes biased patterns from compressed AI models by introducing filtering techniques that ensure fairness while keeping the model's accuracy intact. The second takes an implicit approach by improving how generative models interpret instructions, making them less likely to generate biased responses in under-informative scenarios. By improving models' general-purpose understanding, this method helps reduce biases without relying on direct debiasing techniques. Our evaluations show that these strategies significantly reduce harmful stereotypes across multiple identity dimensions, making AI systems more fair and reliable while ensuring they remain effective in real-world applications.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:42534	en
dc.identifier.uri	https://hdl.handle.net/10919/124828	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	generative models	en
dc.subject	adversarial attacks	en
dc.subject	bias mitigation	en
dc.subject	stereotype evaluation	en
dc.subject	llm-human collaboration	en
dc.subject	visual stereotypes	en
dc.subject	instruction-tuning	en
dc.subject	cross-cultural bias	en
dc.title	Adversarial Risks and Stereotype Mitigation at Scale in Generative Models	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Jha_A_D_2025.pdf
Size:: 22.59 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations