The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.
- Paper ID: 2509.21787
- Title: DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images
- Authors: Dwip Dalal, Gautam Vashishtha, Anku Rani, Aishwarya Reganti, Parth Patwa, Mohd Sarique, Chandan Gupta, Keshav Nath, Viswanatha Reddy, Vinija Jain, Aman Chadha, Amitava Das, Amit Sheth, Asif Ekbal
- Classification: cs.CV cs.CL
- Conference: Defactify 3: Third Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2024
- Paper Link: https://arxiv.org/abs/2509.21787
The proliferation of harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. To address this, the paper introduces a specialized multimodal dataset designed for identifying hate speech in digital content. The core methodology innovatively applies watermarked, stability-enhanced Stable Diffusion technology combined with a Digital Attention Analysis Module (DAAM). This combination enables precise localization of hateful elements within images, generating detailed hate attention maps for blurring these regions and removing hateful components from images. The authors release this dataset as part of the DeHate shared task and propose DeHater, a vision-language model specifically designed for multimodal de-hatification tasks.
The core problem addressed in this research is the detection and mitigation of hate speech in multimodal environments, particularly in image-text combinations. With the rapid development of AI applications, large language models (LLMs) containing hateful content in their training data not only compromise model utility but also raise serious ethical concerns.
- Digital Environment Health: The surge in online hate content severely impacts the quality of public discourse
- AI Ethics: Hateful content in training data directly affects the trustworthiness and ethical integrity of AI systems
- Social Responsibility: There is a need to develop responsible AI systems to address hate speech on social media
- Lack of high-quality multimodal hate speech detection datasets
- Existing methods primarily focus on single modalities (text or image), lacking effective multimodal fusion
- Absence of targeted techniques for hate content localization and removal
Based on the need for high-quality datasets and the technical challenges of multimodal hate speech detection, this work aims to construct an innovative dataset and methodological framework to advance responsible AI development.
- Innovative Dataset Construction Method: Proposes a multimodal hate speech dataset generation method based on Stable Diffusion and DAAM
- Multimodal De-hatification Model: Designs the DeHater model capable of unsupervised masking of hateful image content guided by text prompts
- Shared Task Organization: Releases the DeHate dataset containing 2,411 instances and organizes related shared tasks
- Technical Method Innovation: Combines innovative architectural design integrating CLIP encoders, U-Net architecture, and FiLM modulation techniques
The paper defines the task as multimodal image de-hatification: given an image containing hateful content and corresponding text prompts, the model must identify and mask hateful regions in the image, generating a de-hatified image version.
- Hatenorm Dataset: Employs manually annotated parallel corpora of hateful text and their normalized versions
- Stable Diffusion Generation: Utilizes the stable-diffusion-2-base model to convert hateful text into visual representations
- Image Generation: Extracts keywords from hateful text to construct prompts, using Stable Diffusion to generate corresponding images
- Attention Map Generation: Applies DAAM technology to generate heatmaps highlighting the relevance of specific pixels to prompt components
- Selective Blurring:
- Computes global heatmap values and establishes thresholds to generate binary masks
- Sets high heatmap value pixels to black (0,0,0)
- Computes local neighborhood average colors for marked pixels and applies them
DeHater employs an unsupervised image masking approach, using text prompts to guide the identification and occlusion of harmful regions in images.
- CLIP Encoder:
- Utilizes a frozen CLIP model as the encoder
- Leverages its advantages from pretraining on diverse image-text pairs
- Extracts rich multimodal feature representations
- U-Net-Inspired Connections:
- Adopts skip connection design from U-Net architecture
- Transmits local information from CLIP encoder to decoder
- Preserves critical details while maintaining decoder compactness
- Feature Integration Mechanism:
- Integrates encoder activations (including CLS token) into each transformer block of the decoder
- Enriches decoder understanding of context
- FiLM Modulation:
- Employs Feature-wise Linear Modulation technique
- Modulates decoder input activations through conditional vectors
- Enhances decoder focus and accurate segmentation of hateful content
- Learnable Projection Network:
- Combines multiple hate fragment embeddings into a single projection
- Achieves nuanced and effective compression of diverse hateful elements
The model outputs a binarized image, clearly identifying regions deemed hateful in the original content and masking them.
- Multimodal Fusion: First combination of Stable Diffusion with DAAM for hate speech detection
- Attention Mechanism: Innovative use of cross-attention maps for hate content localization
- Architecture Design: Combined CLIP+U-Net+FiLM architectural design
- Unsupervised Learning: Achieves unsupervised image masking based on text prompts
- DeHate Dataset: Total of 2,411 instances
- Training set: 1,687 instances
- Test set: 724 instances
- Data Composition: Each instance includes the original generated image and the image with blurred hateful components
Uses Intersection over Union (IoU) as the primary evaluation metric, calculating the overlap between predicted blurred components and ground truth blurred components.
- Participating Teams: 20+ registrations, 5 valid submissions
- Evaluation Method: Ranking based on IoU scores on the test set
| Rank | Team Name | IoU Score |
|---|
| 1 | UniteToModerate | 0.55 |
| 2 | PaulJane | 0.51 |
| 3 | Baseline (This Work) | 0.49 |
| 4 | Markans | 0.48 |
| 5 | Sanskarfc | 0.47 |
| 6 | rachitmodi | 0.44 |
- Baseline Performance: The proposed baseline method achieves an IoU score of 0.49
- Task Difficulty: The best performance of only 0.55 indicates this task presents considerable challenges
- Performance Gap: Small performance differences among participating systems suggest substantial room for improvement
The UniteToModerate team employed a combination of NExT-Chat and UniFusion models:
- NExT-Chat: Provides initial mask generation through the pix2emb method
- UniFusion: Enhances accuracy through hierarchical fusion of visual and reference features
- Unimodal Research: Encompasses text hate speech detection in English and other languages
- Multimodal Research: Recent expansion to cross-modal hate detection
- Dataset Contributions: Datasets including memotion, Multioff, OLID, and MMHS150K
- Attention Mechanisms: Application of cross-attention maps in visual models
- Diffusion Models: Interpretability research on Latent Diffusion Models
- DAAM Technology: Methods for aggregating cross-attention maps in denoising modules
- Stable Diffusion: Efficient image generation model
- CLIP: Contrastive Language-Image Pretraining technique
- U-Net: Successful application in image segmentation tasks
- Successfully constructs the first multimodal hate speech dataset based on Stable Diffusion
- The proposed DeHater model provides an effective baseline method for multimodal de-hatification tasks
- Organization of the shared task advances research development in this field
- Performance Constraints: Best IoU score of only 0.55 indicates methods require further improvement
- Dataset Scale: Relatively small dataset size (2,411 instances)
- Language Limitations: Primarily focuses on English content, lacking multilingual support
- Single Evaluation Metric: Uses only IoU as evaluation metric, potentially insufficient for comprehensive assessment
- LLM Integration: Utilize large language models to interpret outputs of hate speech mitigation pipelines
- Multilingual Extension: Extend work to other languages and modalities
- Method Improvement: Develop more precise hate content localization and removal techniques
- Problem Importance: Addresses critical issues in AI ethics and social responsibility
- Method Innovation: First combination of Stable Diffusion with DAAM for hate speech processing
- Data Contribution: Provides valuable multimodal hate speech dataset
- Openness: Promotes field development through shared tasks
- Technical Integration: Skillfully combines multiple cutting-edge technologies (CLIP, U-Net, FiLM)
- Limited Performance: Overall performance level is modest, with best method IoU of only 0.55
- Insufficient Evaluation: Lacks human evaluation and qualitative analysis
- Interpretability: Insufficient explanation of model decision-making processes
- Generalization Ability: Insufficient validation of method generalization across different hate content types
- Ethical Considerations: Insufficient discussion of potential negative impacts of generating hateful images
- Field Contribution: Provides new research direction for multimodal hate speech detection
- Practical Value: Provides technical foundation for social media content moderation
- Reproducibility: Offers detailed method descriptions and datasets
- Social Significance: Advances responsible AI development
- Social Media: Automatic platform content moderation and filtering
- Online Education: Content safety assurance for educational platforms
- AI Training: Cleaning harmful content from AI model training data
- Research Tools: Provides benchmark datasets and methods for related research
The paper cites extensive related work, including:
- Classical datasets and methods for hate speech detection
- Foundational technologies such as Stable Diffusion and CLIP
- Deep learning interpretability research
- Multimodal learning and attention mechanism research
Overall Assessment: This is a paper of significant social importance and technical innovation. While there is room for performance improvement, it provides valuable data resources and methodological foundations for the multimodal hate speech detection field, contributing positively to advancing responsible AI development.