Adversarial Risks and Stereotype Mitigation at Scale in Generative Models

Jha, Akshita

Adversarial Risks and Stereotype Mitigation at Scale in Generative Models

Files

Jha_A_D_2025.pdf (22.59 MB)

Downloads: 230

Date

2025-03-07

Authors

Jha, Akshita

Publisher

Virginia Tech

Abstract

Generative models have rapidly evolved to produce coherent text, realistic images, and functional code. Yet these remarkable capabilities also expose critical vulnerabilities -- ranging from subtle adversarial attacks to harmful stereotypes -- that pose both technical and societal challenges. This research investigates these challenges across three modalities (code, text, and vision) before focusing on strategies to mitigate biases specifically in generative language models. First, we reveal how programming language (PL) models rely on a `natural channel' of code, such as human-readable tokens and structure, that adversaries can exploit with minimal perturbations. These attacks expose the fragility of state-of-the-art PL models, highlighting how superficial patterns and hidden assumptions in training data can lead to unanticipated vulnerabilities. Extending this analysis to textual and visual domains, we show how over-reliance on patterns seen in training data manifests as ingrained biases and harmful stereotypes. To enable more inclusive and globally representative model evaluations, we introduce SeeGULL, a large-scale benchmark of thousands of stereotypes spanning diverse cultures and identity groups worldwide. We also develop ViSAGe, a benchmark for identifying visual stereotypes at scale in text-to-image (T2I) models, illustrating the persistence of stereotypes in generated images even when prompted otherwise. Building on these findings, we propose two complementary approaches to mitigate stereotypical outputs in language models. The first is an explicit method that uses fairness constraints for model pruning, ensuring essential bias-mitigating features remain intact. The second is an implicit bias mitigation framework that makes a crucial distinction between comprehension failures and inherently learned stereotypes. This approach uses instruction tuning on general-purpose datasets and mitigates stereotypes implicitly without relying on targeted debiasing techniques. Extensive evaluations on state-of-the-art models demonstrate that our methods substantially reduce harmful stereotypes across multiple identity dimensions, while preserving downstream performance.

Keywords

generative models, adversarial attacks, bias mitigation, stereotype evaluation, llm-human collaboration, visual stereotypes, instruction-tuning, cross-cultural bias

Persistent link

https://hdl.handle.net/10919/124828

Collections

Doctoral Dissertations

Full item page

Adversarial Risks and Stereotype Mitigation at Scale in Generative Models

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections