2025-11-12T11:16:10.224319

DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images

Dalal, Vashishtha, Rani et al.

The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.

academic

DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images

Basic Information

Paper ID: 2509.21787
Title: DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images
Authors: Dwip Dalal, Gautam Vashishtha, Anku Rani, Aishwarya Reganti, Parth Patwa, Mohd Sarique, Chandan Gupta, Keshav Nath, Viswanatha Reddy, Vinija Jain, Aman Chadha, Amitava Das, Amit Sheth, Asif Ekbal
Classification: cs.CV cs.CL
Conference: Defactify 3: Third Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2024
Paper Link: https://arxiv.org/abs/2509.21787

Abstract

The proliferation of harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. To address this, the paper introduces a specialized multimodal dataset designed for identifying hate speech in digital content. The core methodology innovatively applies watermarked, stability-enhanced Stable Diffusion technology combined with a Digital Attention Analysis Module (DAAM). This combination enables precise localization of hateful elements within images, generating detailed hate attention maps for blurring these regions and removing hateful components from images. The authors release this dataset as part of the DeHate shared task and propose DeHater, a vision-language model specifically designed for multimodal de-hatification tasks.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is the detection and mitigation of hate speech in multimodal environments, particularly in image-text combinations. With the rapid development of AI applications, large language models (LLMs) containing hateful content in their training data not only compromise model utility but also raise serious ethical concerns.

Significance

Digital Environment Health: The surge in online hate content severely impacts the quality of public discourse
AI Ethics: Hateful content in training data directly affects the trustworthiness and ethical integrity of AI systems
Social Responsibility: There is a need to develop responsible AI systems to address hate speech on social media

Limitations of Existing Approaches

Lack of high-quality multimodal hate speech detection datasets
Existing methods primarily focus on single modalities (text or image), lacking effective multimodal fusion
Absence of targeted techniques for hate content localization and removal

Research Motivation

Based on the need for high-quality datasets and the technical challenges of multimodal hate speech detection, this work aims to construct an innovative dataset and methodological framework to advance responsible AI development.

Core Contributions

Innovative Dataset Construction Method: Proposes a multimodal hate speech dataset generation method based on Stable Diffusion and DAAM
Multimodal De-hatification Model: Designs the DeHater model capable of unsupervised masking of hateful image content guided by text prompts
Shared Task Organization: Releases the DeHate dataset containing 2,411 instances and organizes related shared tasks
Technical Method Innovation: Combines innovative architectural design integrating CLIP encoders, U-Net architecture, and FiLM modulation techniques

Methodology Details

Task Definition

The paper defines the task as multimodal image de-hatification: given an image containing hateful content and corresponding text prompts, the model must identify and mask hateful regions in the image, generating a de-hatified image version.

Dataset Construction Method

Base Data Sources

Hatenorm Dataset: Employs manually annotated parallel corpora of hateful text and their normalized versions
Stable Diffusion Generation: Utilizes the stable-diffusion-2-base model to convert hateful text into visual representations

Core Technical Pipeline

Image Generation: Extracts keywords from hateful text to construct prompts, using Stable Diffusion to generate corresponding images
Attention Map Generation: Applies DAAM technology to generate heatmaps highlighting the relevance of specific pixels to prompt components
Selective Blurring:
- Computes global heatmap values and establishes thresholds to generate binary masks
- Sets high heatmap value pixels to black (0,0,0)
- Computes local neighborhood average colors for marked pixels and applies them

DeHater Model Architecture

Overall Design Philosophy

DeHater employs an unsupervised image masking approach, using text prompts to guide the identification and occlusion of harmful regions in images.

Core Components

CLIP Encoder:
- Utilizes a frozen CLIP model as the encoder
- Leverages its advantages from pretraining on diverse image-text pairs
- Extracts rich multimodal feature representations
U-Net-Inspired Connections:
- Adopts skip connection design from U-Net architecture
- Transmits local information from CLIP encoder to decoder
- Preserves critical details while maintaining decoder compactness
Feature Integration Mechanism:
- Integrates encoder activations (including CLS token) into each transformer block of the decoder
- Enriches decoder understanding of context
FiLM Modulation:
- Employs Feature-wise Linear Modulation technique
- Modulates decoder input activations through conditional vectors
- Enhances decoder focus and accurate segmentation of hateful content
Learnable Projection Network:
- Combines multiple hate fragment embeddings into a single projection
- Achieves nuanced and effective compression of diverse hateful elements

Output Mechanism

The model outputs a binarized image, clearly identifying regions deemed hateful in the original content and masking them.

Technical Innovations

Multimodal Fusion: First combination of Stable Diffusion with DAAM for hate speech detection
Attention Mechanism: Innovative use of cross-attention maps for hate content localization
Architecture Design: Combined CLIP+U-Net+FiLM architectural design
Unsupervised Learning: Achieves unsupervised image masking based on text prompts

Experimental Setup

Dataset

DeHate Dataset: Total of 2,411 instances
- Training set: 1,687 instances
- Test set: 724 instances
Data Composition: Each instance includes the original generated image and the image with blurred hateful components

Evaluation Metrics

Uses Intersection over Union (IoU) as the primary evaluation metric, calculating the overlap between predicted blurred components and ground truth blurred components.

Shared Task Setup

Participating Teams: 20+ registrations, 5 valid submissions
Evaluation Method: Ranking based on IoU scores on the test set

Experimental Results

Main Results

Rank	Team Name	IoU Score
1	UniteToModerate	0.55
2	PaulJane	0.51
3	Baseline (This Work)	0.49
4	Markans	0.48
5	Sanskarfc	0.47
6	rachitmodi	0.44

Results Analysis

Baseline Performance: The proposed baseline method achieves an IoU score of 0.49
Task Difficulty: The best performance of only 0.55 indicates this task presents considerable challenges
Performance Gap: Small performance differences among participating systems suggest substantial room for improvement

Winning Method Analysis

The UniteToModerate team employed a combination of NExT-Chat and UniFusion models:

NExT-Chat: Provides initial mask generation through the pix2emb method
UniFusion: Enhances accuracy through hierarchical fusion of visual and reference features

Hate Speech Detection Research

Unimodal Research: Encompasses text hate speech detection in English and other languages
Multimodal Research: Recent expansion to cross-modal hate detection
Dataset Contributions: Datasets including memotion, Multioff, OLID, and MMHS150K

Deep Learning Interpretability

Attention Mechanisms: Application of cross-attention maps in visual models
Diffusion Models: Interpretability research on Latent Diffusion Models
DAAM Technology: Methods for aggregating cross-attention maps in denoising modules

Technical Foundations

Stable Diffusion: Efficient image generation model
CLIP: Contrastive Language-Image Pretraining technique
U-Net: Successful application in image segmentation tasks

Conclusions and Discussion

Main Conclusions

Successfully constructs the first multimodal hate speech dataset based on Stable Diffusion
The proposed DeHater model provides an effective baseline method for multimodal de-hatification tasks
Organization of the shared task advances research development in this field

Limitations

Performance Constraints: Best IoU score of only 0.55 indicates methods require further improvement
Dataset Scale: Relatively small dataset size (2,411 instances)
Language Limitations: Primarily focuses on English content, lacking multilingual support
Single Evaluation Metric: Uses only IoU as evaluation metric, potentially insufficient for comprehensive assessment

Future Directions

LLM Integration: Utilize large language models to interpret outputs of hate speech mitigation pipelines
Multilingual Extension: Extend work to other languages and modalities
Method Improvement: Develop more precise hate content localization and removal techniques

In-Depth Evaluation

Strengths

Problem Importance: Addresses critical issues in AI ethics and social responsibility
Method Innovation: First combination of Stable Diffusion with DAAM for hate speech processing
Data Contribution: Provides valuable multimodal hate speech dataset
Openness: Promotes field development through shared tasks
Technical Integration: Skillfully combines multiple cutting-edge technologies (CLIP, U-Net, FiLM)

Weaknesses

Limited Performance: Overall performance level is modest, with best method IoU of only 0.55
Insufficient Evaluation: Lacks human evaluation and qualitative analysis
Interpretability: Insufficient explanation of model decision-making processes
Generalization Ability: Insufficient validation of method generalization across different hate content types
Ethical Considerations: Insufficient discussion of potential negative impacts of generating hateful images

Impact

Field Contribution: Provides new research direction for multimodal hate speech detection
Practical Value: Provides technical foundation for social media content moderation
Reproducibility: Offers detailed method descriptions and datasets
Social Significance: Advances responsible AI development

Applicable Scenarios

Social Media: Automatic platform content moderation and filtering
Online Education: Content safety assurance for educational platforms
AI Training: Cleaning harmful content from AI model training data
Research Tools: Provides benchmark datasets and methods for related research

References

The paper cites extensive related work, including:

Classical datasets and methods for hate speech detection
Foundational technologies such as Stable Diffusion and CLIP
Deep learning interpretability research
Multimodal learning and attention mechanism research

Overall Assessment: This is a paper of significant social importance and technical innovation. While there is room for performance improvement, it provides valuable data resources and methodological foundations for the multimodal hate speech detection field, contributing positively to advancing responsible AI development.