Entity-Aware Optimal Transport and Residual Attention for Multimodal Content Moderation
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The increasing prevalence of memes on social media platforms has amplified both the positive and negative impact of these highly shareable, multimodal artifacts. While memes can be humorous and engaging, they can also serve as vehicles for hateful or harmful content that targets specific social, ethnic, or political groups. In this paper, we propose ImTOTMeme, a novel framework for harmful meme detection that combines an optimal transport–based alignment mechanism with global residual interactions to better capture both local and contextual cues. We leverage CLIP embeddings for initial image and text representations and employ Sinkhorn iteration to learn a minimal-cost matching between fine-grained visual tokens and OCR-extracted text tokens. We further incorporate facial embeddings and entity information, allowing for more nuanced analysis of memes involving human subjects or contextual references. Through experiments on four publicly available datasets: Harm-C, Harm-P, FHM, and MultiOFF, we demonstrate that ImTOTMeme achieves competitive accuracy in both binary and multi-class settings. We further conduct an ablation study to verify the significance of each component in our framework, and use LIME-based visualizations to provide deeper interpretability into the model’s classification decisions. Our findings highlight that an approach that balances local token-level alignment with broader contextual modeling can effectively detect harmful memes across diverse topical domains, paving the way for more robust and transparent content moderation on social media.