Adversarial Risks and Stereotype Mitigation at Scale in Generative Models
dc.contributor.author | Jha, Akshita | en |
dc.contributor.committeechair | Reddy, Chandan K. | en |
dc.contributor.committeemember | Prabhakaran, Vinodkumar | en |
dc.contributor.committeemember | Blodgett, Su Lin | en |
dc.contributor.committeemember | Wang, Xuan | en |
dc.contributor.committeemember | Huang, Lifu | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-03-08T09:00:11Z | en |
dc.date.available | 2025-03-08T09:00:11Z | en |
dc.date.issued | 2025-03-07 | en |
dc.description.abstract | Generative models have rapidly evolved to produce coherent text, realistic images, and functional code. Yet these remarkable capabilities also expose critical vulnerabilities -- ranging from subtle adversarial attacks to harmful stereotypes -- that pose both technical and societal challenges. This research investigates these challenges across three modalities (code, text, and vision) before focusing on strategies to mitigate biases specifically in generative language models. First, we reveal how programming language (PL) models rely on a `natural channel' of code, such as human-readable tokens and structure, that adversaries can exploit with minimal perturbations. These attacks expose the fragility of state-of-the-art PL models, highlighting how superficial patterns and hidden assumptions in training data can lead to unanticipated vulnerabilities. Extending this analysis to textual and visual domains, we show how over-reliance on patterns seen in training data manifests as ingrained biases and harmful stereotypes. To enable more inclusive and globally representative model evaluations, we introduce SeeGULL, a large-scale benchmark of thousands of stereotypes spanning diverse cultures and identity groups worldwide. We also develop ViSAGe, a benchmark for identifying visual stereotypes at scale in text-to-image (T2I) models, illustrating the persistence of stereotypes in generated images even when prompted otherwise. Building on these findings, we propose two complementary approaches to mitigate stereotypical outputs in language models. The first is an explicit method that uses fairness constraints for model pruning, ensuring essential bias-mitigating features remain intact. The second is an implicit bias mitigation framework that makes a crucial distinction between comprehension failures and inherently learned stereotypes. This approach uses instruction tuning on general-purpose datasets and mitigates stereotypes implicitly without relying on targeted debiasing techniques. Extensive evaluations on state-of-the-art models demonstrate that our methods substantially reduce harmful stereotypes across multiple identity dimensions, while preserving downstream performance. | en |
dc.description.abstractgeneral | AI systems, especially generative models that create text, images, and code, have advanced rapidly. They can write essays, generate realistic pictures, and assist with programming. However, these impressive capabilities also come with vulnerabilities that pose both technical and societal challenges. Some of these models can be subtly manipulated into making errors, while others unknowingly reinforce harmful stereotypes present in their training data. This research examines these challenges across three types of generative models: those that generate code, text, and images. First, we investigate how generative models that generate code rely on human-readable patterns that attackers can subtly manipulate, revealing hidden weaknesses in even the most advanced models. Extending this analysis to text and image generation, we show how these models often over-rely on patterns from their training data, leading to harmful stereotypes. To systematically study these issues, we introduce two large-scale benchmarks: SeeGULL, a dataset that identifies stereotypes across cultures and identity groups in AI-generated text, and ViSAGe, a dataset that uncovers hidden biases in AI-generated images. Building on these insights, we propose two complementary solutions to reduce biases in generative language models. The first method explicitly removes biased patterns from compressed AI models by introducing filtering techniques that ensure fairness while keeping the model's accuracy intact. The second takes an implicit approach by improving how generative models interpret instructions, making them less likely to generate biased responses in under-informative scenarios. By improving models' general-purpose understanding, this method helps reduce biases without relying on direct debiasing techniques. Our evaluations show that these strategies significantly reduce harmful stereotypes across multiple identity dimensions, making AI systems more fair and reliable while ensuring they remain effective in real-world applications. | en |
dc.description.degree | Doctor of Philosophy | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:42534 | en |
dc.identifier.uri | https://hdl.handle.net/10919/124828 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | generative models | en |
dc.subject | adversarial attacks | en |
dc.subject | bias mitigation | en |
dc.subject | stereotype evaluation | en |
dc.subject | llm-human collaboration | en |
dc.subject | visual stereotypes | en |
dc.subject | instruction-tuning | en |
dc.subject | cross-cultural bias | en |
dc.title | Adversarial Risks and Stereotype Mitigation at Scale in Generative Models | en |
dc.type | Dissertation | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | doctoral | en |
thesis.degree.name | Doctor of Philosophy | en |
Files
Original bundle
1 - 1 of 1