Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification
Malik, Sharma, Bhatt et al.
Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker's linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.
academic
Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification
Large Language Models (LLMs) demonstrate significant potential for content moderation and hate speech detection applications. However, these models exhibit vulnerabilities and biases toward marginalized communities and dialects. This study investigates the robustness of LLMs in hate speech classification by injecting explicit and implicit markers of speaker ethnicity into inputs. The research reveals that implicit dialect markers are more likely to cause model output flipping than explicit markers, with flip percentages varying by ethnicity, and larger models demonstrating greater robustness.
Practical Application Needs: Language technologies are increasingly deployed for content moderation tasks, including hate speech detection, due to their capacity to process large volumes of data
High-Risk Task: Hate speech detection is a high-stakes task requiring careful LLM deployment
Global Challenges: With LLM adoption expanding globally, maintaining inclusivity across all nationalities is essential
Based on these issues, this paper aims to systematically analyze the impact of speaker identity on LLM hate speech classification, addressing gaps in existing research regarding user identity effects.
First Systematic Study: Novel investigation of speaker identity effects on LLM hate speech detection
Dual Marking Methodology: Proposes a systematic approach using explicit and implicit markers to inform models of speaker identity
Comprehensive Experimental Evaluation: Conducts extensive experiments across 4 language models and 2 datasets, revealing model vulnerabilities across different settings
Important Findings: Discovers that implicit dialect markers are more likely to cause output flipping than explicit markers, with flip rates varying by ethnicity
Input: English sentence + speaker ethnicity identity marker (explicit or implicit)
Output: Hate speech classification (Hateful/Non-Hateful)
Objective: Analyze the degree to which identity markers influence classification results
The paper cites multiple important studies, including:
Sap et al. (2019): Racial bias risks in hate speech detection
Field et al. (2021, 2023): Racism investigations in NLP
Harris et al. (2022): African American English bias in hate speech classification
Ribeiro et al. (2020): Behavioral testing framework CheckList for NLP models
Overall Assessment: This is an important research paper in the AI ethics and fairness domain. Through systematic experimental design and comprehensive evaluation, it reveals identity bias problems in LLMs for hate speech detection tasks. While solutions require further development, the paper provides valuable insights and warnings for research and practice in this field.