A Submodular Approach to Find Interpretable Directions in Text-to-Image Models

TR Number

Date

2025-06-10

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

Text-to-image models have significantly improved the field of image editing. However, finding attributes that the model can actually edit is still a remaining challenge. This thesis proposes a solution to this problem by leveraging a multimodal vision-language model (MMVLM) to find a list of potential attributes that can be used to edit an image, using Flux and ControlNet to generate edits using those keywords, and then applying a submodular ranking method to find which edits actually work. The experiments in this paper demonstrate the robustness of this approach and its ability to produce high-quality edits across various domains, such as dresses and living rooms.

Description

Keywords

Diffusion Models, Interpretability, Image Editing, Explainable AI, Recommendation Systems

Citation

Collections