A Submodular Approach to Find Interpretable Directions in Text-to-Image Models

Allada, Ritika2025-06-112025-06-112025-06-10vt_gsexam:44224https://hdl.handle.net/10919/135475Text-to-image models have significantly improved the field of image editing. However, finding attributes that the model can actually edit is still a remaining challenge. This thesis proposes a solution to this problem by leveraging a multimodal vision-language model (MMVLM) to find a list of potential attributes that can be used to edit an image, using Flux and ControlNet to generate edits using those keywords, and then applying a submodular ranking method to find which edits actually work. The experiments in this paper demonstrate the robustness of this approach and its ability to produce high-quality edits across various domains, such as dresses and living rooms.ETDenCreative Commons Attribution 4.0 InternationalDiffusion ModelsInterpretabilityImage EditingExplainable AIRecommendation SystemsA Submodular Approach to Find Interpretable Directions in Text-to-Image ModelsThesis