Adversarial Risks and Stereotype Mitigation at Scale in Generative Models

dc.contributor.authorJha, Akshitaen
dc.contributor.committeechairReddy, Chandan K.en
dc.contributor.committeememberPrabhakaran, Vinodkumaren
dc.contributor.committeememberBlodgett, Su Linen
dc.contributor.committeememberWang, Xuanen
dc.contributor.committeememberHuang, Lifuen
dc.contributor.departmentComputer Science and#38; Applicationsen
dc.date.accessioned2025-03-08T09:00:11Zen
dc.date.available2025-03-08T09:00:11Zen
dc.date.issued2025-03-07en
dc.description.abstractGenerative models have rapidly evolved to produce coherent text, realistic images, and functional code. Yet these remarkable capabilities also expose critical vulnerabilities -- ranging from subtle adversarial attacks to harmful stereotypes -- that pose both technical and societal challenges. This research investigates these challenges across three modalities (code, text, and vision) before focusing on strategies to mitigate biases specifically in generative language models. First, we reveal how programming language (PL) models rely on a `natural channel' of code, such as human-readable tokens and structure, that adversaries can exploit with minimal perturbations. These attacks expose the fragility of state-of-the-art PL models, highlighting how superficial patterns and hidden assumptions in training data can lead to unanticipated vulnerabilities. Extending this analysis to textual and visual domains, we show how over-reliance on patterns seen in training data manifests as ingrained biases and harmful stereotypes. To enable more inclusive and globally representative model evaluations, we introduce SeeGULL, a large-scale benchmark of thousands of stereotypes spanning diverse cultures and identity groups worldwide. We also develop ViSAGe, a benchmark for identifying visual stereotypes at scale in text-to-image (T2I) models, illustrating the persistence of stereotypes in generated images even when prompted otherwise. Building on these findings, we propose two complementary approaches to mitigate stereotypical outputs in language models. The first is an explicit method that uses fairness constraints for model pruning, ensuring essential bias-mitigating features remain intact. The second is an implicit bias mitigation framework that makes a crucial distinction between comprehension failures and inherently learned stereotypes. This approach uses instruction tuning on general-purpose datasets and mitigates stereotypes implicitly without relying on targeted debiasing techniques. Extensive evaluations on state-of-the-art models demonstrate that our methods substantially reduce harmful stereotypes across multiple identity dimensions, while preserving downstream performance.en
dc.description.abstractgeneralAI systems, especially generative models that create text, images, and code, have advanced rapidly. They can write essays, generate realistic pictures, and assist with programming. However, these impressive capabilities also come with vulnerabilities that pose both technical and societal challenges. Some of these models can be subtly manipulated into making errors, while others unknowingly reinforce harmful stereotypes present in their training data. This research examines these challenges across three types of generative models: those that generate code, text, and images. First, we investigate how generative models that generate code rely on human-readable patterns that attackers can subtly manipulate, revealing hidden weaknesses in even the most advanced models. Extending this analysis to text and image generation, we show how these models often over-rely on patterns from their training data, leading to harmful stereotypes. To systematically study these issues, we introduce two large-scale benchmarks: SeeGULL, a dataset that identifies stereotypes across cultures and identity groups in AI-generated text, and ViSAGe, a dataset that uncovers hidden biases in AI-generated images. Building on these insights, we propose two complementary solutions to reduce biases in generative language models. The first method explicitly removes biased patterns from compressed AI models by introducing filtering techniques that ensure fairness while keeping the model's accuracy intact. The second takes an implicit approach by improving how generative models interpret instructions, making them less likely to generate biased responses in under-informative scenarios. By improving models' general-purpose understanding, this method helps reduce biases without relying on direct debiasing techniques. Our evaluations show that these strategies significantly reduce harmful stereotypes across multiple identity dimensions, making AI systems more fair and reliable while ensuring they remain effective in real-world applications.en
dc.description.degreeDoctor of Philosophyen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:42534en
dc.identifier.urihttps://hdl.handle.net/10919/124828en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectgenerative modelsen
dc.subjectadversarial attacksen
dc.subjectbias mitigationen
dc.subjectstereotype evaluationen
dc.subjectllm-human collaborationen
dc.subjectvisual stereotypesen
dc.subjectinstruction-tuningen
dc.subjectcross-cultural biasen
dc.titleAdversarial Risks and Stereotype Mitigation at Scale in Generative Modelsen
dc.typeDissertationen
thesis.degree.disciplineComputer Science & Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.nameDoctor of Philosophyen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Jha_A_D_2025.pdf
Size:
22.59 MB
Format:
Adobe Portable Document Format