2025-11-14T00:07:11.264849

Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

Malik, Sharma, Bhatt et al.
Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker's linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.
academic

Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

Basic Information

  • Paper ID: 2410.20490
  • Title: Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification
  • Authors: Ananya Malik (Northeastern University), Kartik Sharma (Georgia Institute of Technology), Shaily Bhatt (Carnegie Mellon University), Lynnette Hui Xian Ng (Carnegie Mellon University)
  • Classification: cs.CL cs.AI
  • Publication Date: October 12, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2410.20490

Abstract

Large Language Models (LLMs) demonstrate significant potential for content moderation and hate speech detection applications. However, these models exhibit vulnerabilities and biases toward marginalized communities and dialects. This study investigates the robustness of LLMs in hate speech classification by injecting explicit and implicit markers of speaker ethnicity into inputs. The research reveals that implicit dialect markers are more likely to cause model output flipping than explicit markers, with flip percentages varying by ethnicity, and larger models demonstrating greater robustness.

Research Background and Motivation

Core Research Question

How robust are large language models in hate speech detection tasks when input text contains speaker ethnicity information?

Significance

  1. Practical Application Needs: Language technologies are increasingly deployed for content moderation tasks, including hate speech detection, due to their capacity to process large volumes of data
  2. High-Risk Task: Hate speech detection is a high-stakes task requiring careful LLM deployment
  3. Global Challenges: With LLM adoption expanding globally, maintaining inclusivity across all nationalities is essential

Limitations of Existing Approaches

  1. Bias Issues: LLMs are known to exhibit biases against marginalized communities and dialects, resulting in unfair treatment and representational harm
  2. Vulnerability: LLMs demonstrate fragility, bias, and uncertainty when presented with extraneous information unrelated to the task itself
  3. Dialect Preference: Existing research shows these models favor American English despite different geographic regions using distinct English dialects

Research Motivation

Based on these issues, this paper aims to systematically analyze the impact of speaker identity on LLM hate speech classification, addressing gaps in existing research regarding user identity effects.

Core Contributions

  1. First Systematic Study: Novel investigation of speaker identity effects on LLM hate speech detection
  2. Dual Marking Methodology: Proposes a systematic approach using explicit and implicit markers to inform models of speaker identity
  3. Comprehensive Experimental Evaluation: Conducts extensive experiments across 4 language models and 2 datasets, revealing model vulnerabilities across different settings
  4. Important Findings: Discovers that implicit dialect markers are more likely to cause output flipping than explicit markers, with flip rates varying by ethnicity

Methodology

Task Definition

Input: English sentence + speaker ethnicity identity marker (explicit or implicit) Output: Hate speech classification (Hateful/Non-Hateful) Objective: Analyze the degree to which identity markers influence classification results

Experimental Design

1. Language Identity Selection

Five nationalities/groups with distinct English dialects:

  • Indian
  • Singaporean
  • British
  • Jamaican
  • African-American

2. Marker Injection Methods

Explicit Markers: Direct mention of linguistic identity in prompts

Example: The [ethnicity] person said, "[input]"

Implicit Markers: Implicit speaker identity indication through dialect feature injection, including:

  • Dialect-specific colloquial vocabulary (e.g., Singapore's "mah," British "mate")
  • Cultural themes and phrases
  • Code-mixing
  • Region-specific spelling

3. Dialect Data Generation

Few-shot learning using Llama-3-70B to generate dialect data:

  • Temperature set to 0 for deterministic output
  • Instructions included to avoid content filtering
  • Manual verification conducted to ensure quality

Quality Verification

Multi-dimensional evaluation of generated dialect data:

  1. Dialect Accuracy: Whether vocabulary accurately reflects the given ethnicity's dialect
  2. Context Preservation: Whether original semantics and dialect are maintained
  3. Fluency and Grammar: Whether generated text is fluent and grammatically correct
  4. Latin Script Usage: Whether sentences use English characters

Manual evaluation results show average dialect accuracy of 4/5, with low variance, indicating high generation quality.

Experimental Setup

Datasets

  1. MPBHSD: From Twitter, 4Chan, and Reddit, containing 600 hate speech and 2,400 non-hate speech instances
  2. HateXplain: From Twitter and Gab, sampling 3,000 sentences, including 2,094 hate speech and 906 non-hate speech instances

Models

  • LLMs: Llama-3-8B, Llama-3-70B, GPT-4o
  • Traditional Models: BERT fine-tuned on HateXplain dataset
  • Prompting Strategies: Zero-shot classification and in-context learning (ICL)

Evaluation Metrics

  • Primary Metric: Model output flip percentage
  • Flip Types:
    • NH→H: Non-hate flipped to hate (false positive rate)
    • H→NH: Hate flipped to non-hate (false negative rate)

Experimental Results

Baseline Performance

Without identity markers, models perform well:

  • MPBHSD dataset: Accuracy up to 90%
  • HateXplain dataset: Accuracy reaching 80%

Key Findings

1. Marker Type Impact

  • Implicit markers are more likely to cause model output flipping than explicit markers
  • For all models except Llama-3-8B, flip rates under implicit markers are significantly higher (p < 0.05)

2. Model Scale Effects

  • Larger, more recent models (e.g., Llama-3-70B and GPT-4o) demonstrate greater robustness
  • Lower flip percentages and more stable performance

3. Prompting Technique Impact

  • In-context learning (ICL) typically produces lower flip rates than zero-shot learning
  • Providing examples yields more stable and consistent model outputs

4. Ethnic Variation

Significant differences in flip rates across ethnic identities:

  • In larger models, British and African-American dialect data show higher H→NH flip rates
  • McNemar tests show speaker identity significantly impacts classification across all models (p < 0.05)

5. Original Label Impact

  • Non-hate (NH) predictions typically remain non-hate across different models and speaker identities
  • Hate (H) predictions more easily flip to non-hate, increasing false negative rates

6. Target Group Analysis

  • HateXplain-BERT shows more flipping on certain dialects targeting religious groups
  • GPT-4o shows flipping across all dialects on sexual orientation-related targets

Special Case: Llama-3-8B

This model exhibits anomalously high flip rates:

  • MPBHSD dataset ICL variant approximately 40% flip rate
  • Frequently fails to detect ironic explicit and implicit cues
  • Over-reacts to negative framing
  • More frequent misclassification on shorter inputs

Ablation Studies

Dialect Identification Accuracy

Using GPT-4o evaluator to test model dialect recognition ability:

  • African-American: 96.3%
  • British: 99.8%
  • Indian: 100%
  • Singaporean: 99.8%
  • Jamaican: 100%

High identification accuracy confirms dialect feature effectiveness.

Synthetic Modification Comparison

Testing other synthetic modifications (paraphrasing, voice changes, length constraints) on flip rates:

  • Paraphrasing: H→NH 0.17%, NH→H 0.0%
  • Voice changes: H→NH 0.08%, NH→H 0.02%
  • Length constraints: H→NH 0.16%, NH→H 0.01%

These modifications produce far lower flip rates than dialect injection, confirming the unique impact of identity markers.

Major Research Directions

  1. LLM Bias Research: Extensive literature documenting biases against marginalized communities and dialects
  2. Hate Speech Detection: Traditional approaches primarily focus on content itself, with limited consideration of speaker identity
  3. Cross-Cultural NLP: Research on language processing differences across cultural backgrounds
  4. Dialect Processing: Focus on different English dialects' performance in NLP tasks

Novel Contributions of This Work

  • First systematic study of speaker identity effects on hate speech classification
  • Dual approach using explicit and implicit markers
  • Comprehensive evaluation across multiple models and datasets

Conclusions and Discussion

Main Conclusions

  1. Widespread Vulnerability: All tested LLMs exhibit varying degrees of vulnerability after injecting speaker ethnicity markers
  2. Larger Implicit Impact: Dialect features have greater model impact than explicit identity mentions
  3. Scale Improves Robustness: Larger models demonstrate greater robustness, though biases persist
  4. Significant Ethnic Variation: Different ethnic identities produce significantly different flip rates
  5. False Negative Risk: Models tend to misclassify hate speech as non-hate, potentially allowing harmful content to go undetected

Limitations

  1. Dialect Data Constraints: Lack of manually annotated hate speech data in different dialects
  2. Limited Model Range: Unable to test more "safety-focused" models like Claude due to computational constraints
  3. Dataset Limitations: Restricted to English mixed-dialect datasets
  4. Synthetic Data Bias: Generated dialect data may contain unknown author biases

Future Directions

  1. Multilingual Extension: Expand to multilingual datasets and other hate speech datasets
  2. Interpretability Research: Conduct deeper interpretability studies assessing precise impacts of specific phrases on model prediction patterns
  3. Mitigation Strategies: Develop methods and techniques to reduce identity bias
  4. Larger-Scale Evaluation: Evaluate across more models and larger datasets

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses critical issues in AI ethics and fairness
  2. Methodological Innovation: Proposes systematic approach using explicit and implicit markers
  3. Comprehensive Experiments: Extensive evaluation across multiple models, datasets, and ethnic identities
  4. Result Credibility: Validates result significance through statistical testing
  5. Practical Value: Provides important warnings for LLM deployment in high-stakes tasks

Weaknesses

  1. Causal Mechanisms: While flip phenomena are observed, lacks deep analysis of specific underlying mechanisms
  2. Mitigation Solutions: Primarily identifies problems without proposing concrete solutions
  3. Evaluation Limitations: Relatively small manual evaluation samples (50 samples per dialect)
  4. Dialect Representativeness: Selected dialects may not fully represent regional micro-dialects and communities

Impact

  1. Academic Contribution: Provides new perspectives and methodologies for LLM fairness research
  2. Practical Significance: Offers important guidance for content moderation system design and deployment
  3. Policy Impact: May influence AI system regulation and standard-setting
  4. Foundation for Future Research: Establishes foundation for subsequent research in related fields

Applicable Scenarios

  1. Content Moderation Systems: Hate speech detection systems for social media platforms
  2. AI Ethics Evaluation: LLM fairness and bias assessment
  3. Multicultural AI Systems: AI applications serving global users
  4. Regulatory Compliance: AI system fairness audits and compliance checks

References

The paper cites multiple important studies, including:

  • Sap et al. (2019): Racial bias risks in hate speech detection
  • Field et al. (2021, 2023): Racism investigations in NLP
  • Harris et al. (2022): African American English bias in hate speech classification
  • Ribeiro et al. (2020): Behavioral testing framework CheckList for NLP models

Overall Assessment: This is an important research paper in the AI ethics and fairness domain. Through systematic experimental design and comprehensive evaluation, it reveals identity bias problems in LLMs for hate speech detection tasks. While solutions require further development, the paper provides valuable insights and warnings for research and practice in this field.