A Submodular Approach to Find Interpretable Directions in Text-to-Image Models
Files
TR Number
Date
2025-06-10
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract
Text-to-image models have significantly improved the field of image editing. However, finding attributes that the model can actually edit is still a remaining challenge. This thesis proposes a solution to this problem by leveraging a multimodal vision-language model (MMVLM) to find a list of potential attributes that can be used to edit an image, using Flux and ControlNet to generate edits using those keywords, and then applying a submodular ranking method to find which edits actually work. The experiments in this paper demonstrate the robustness of this approach and its ability to produce high-quality edits across various domains, such as dresses and living rooms.
Description
Keywords
Diffusion Models, Interpretability, Image Editing, Explainable AI, Recommendation Systems