2025-11-17T07:13:12.902991

HebID: Detecting Social Identities in Hebrew-language Political Text

Mor-Lan, Rivlin-Angert, Kaplan et al.
Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.
academic

HebID: Detecting Social Identities in Hebrew-language Political Text

Basic Information

  • Paper ID: 2508.15483
  • Title: HebID: Detecting Social Identities in Hebrew-language Political Text
  • Authors: Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: arXiv preprint, October 12, 2025
  • Paper Link: https://arxiv.org/abs/2508.15483

Abstract

Political language is closely intertwined with social identity. While social identities are often shaped by specific cultural contexts, existing NLP datasets are predominantly English-centric, employ single-label classification, and focus on coarse-grained identity categories. This paper introduces HebID, the first multi-label Hebrew-language corpus for social identity detection, containing 5,536 sentences from Israeli politicians' Facebook posts (December 2018–April 2021), manually annotated with 12 fine-grained social identities (such as right-wing, ultra-Orthodox, socially-oriented) based on survey data. The study compares multi-label and single-label encoders as well as generative large language models with 2B-9B parameters, finding that Hebrew-tuned LLMs perform best (macro-average F1 = 0.74).

Research Background and Motivation

Problem Description

  1. Imbalanced Language Resources: Existing social identity detection resources are almost entirely English-centric, lacking support for non-English political contexts
  2. Coarse Annotation Granularity: Existing datasets primarily focus on coarse-grained categories (such as political party or ethnicity), failing to capture the complexity of political discourse
  3. Single-Label Limitations: Most datasets employ single-label classification, unable to handle the reality of multiple identity expressions
  4. Missing Cultural Context: Lack of identity category selection based on specific cultural backgrounds and empirical surveys

Research Significance

  • Social identity is an important driver of political behavior and public discourse
  • Hebrew, as a low-resource language, is underrepresented in NLP research
  • The complexity of the Israeli political environment provides an ideal scenario for studying multi-dimensional identity expression

Limitations of Existing Approaches

  • Group Mention Detection: Limited to explicit group mentions, unable to capture implicit identity expressions
  • Frame and Stance Analysis: Primarily focuses on single-label stance or frames, lacking multi-label identity category support
  • Ideology Inference: Can only infer broad ideological tendencies, unable to detect explicit identity mentions

Core Contributions

  1. Novel Dataset: Construction of the first publicly available Hebrew-language multi-label social identity detection dataset
  2. Survey-Driven Methodology: Establishment of a framework for text annotation guided by large-scale survey data
  3. Comprehensive Benchmarking: Evaluation of encoder and decoder model performance on this task
  4. Cross-Domain Evaluation: Verification of model generalization ability on parliamentary speech data
  5. External Validation: Validation of classifier effectiveness through the CHES-Israel expert survey
  6. Sociolinguistic Analysis: Revelation of identity dynamics differences across platforms and populations

Methodology Details

Task Definition

Input: Hebrew-language sentences Output: Multi-label binary classification results for 12 social identities Objective: Determine which social identities are actively expressed or referenced in a given sentence

Identity Category Selection Method

  1. Survey Foundation: Based on 12 waves of representative panel surveys (N=1,769), spanning January 2019 to April 2021
  2. Expert Guidance: 28 candidate identities selected by a panel of Israeli political experts
  3. Threshold Filtering: Selection of 12 identities consistently exceeding 5% selection threshold across the first five survey waves

Annotation Scheme

12 Social Identity Categories:

  • Ideology: Right-wing, Left-wing, Conservative, Liberal
  • Economics: Capitalist, Socially-oriented
  • Political Values: Democratic, Honest
  • Cultural-Religious: Zionist, Ultra-Orthodox
  • Group: Palestinians and Arab Israeli citizens, Security-oriented

Annotation Principles:

  • Only annotate actively expressed identities
  • Support multi-label classification
  • Base annotation on content rather than speaker identity

Dataset Construction

  • Source: Facebook posts from Israeli parliamentarians, political parties, and candidates
  • Time Range: December 2018 to April 2021
  • Scale: 5,536 sentences sampled from 64K posts (375K sentences)
  • Inter-annotator Agreement: Average Cohen's κ = 0.77

Experimental Setup

Dataset Partitioning

  • Training Set: 70% (3,875 sentences)
  • Validation Set: 15% (830 sentences)
  • Test Set: 15% (831 sentences)

Model Types

  1. Baseline Models: Logistic regression and LinearSVC (TF-IDF features)
  2. Multi-label Encoders: Joint learning of 12 identity labels
  3. Single-label Encoders: Separate fine-tuning for each label
  4. Decoder LLMs: Generation of comma-separated label lists

Evaluated Models

Encoder Models:

  • Multilingual: mBERT
  • Hebrew-specific: AlephBERT, HERO, DictaBERT (base/large)

Decoder LLMs:

  • General-purpose: Gemma 2 (2B/9B), Qwen3-8B
  • Hebrew-specific: DictaLM2.0

Evaluation Metrics

  • Macro-average precision, recall, and F1 score
  • F1 score for each identity category

Experimental Results

Main Results

Best Performance: DictaLM2.0 achieved macro-average F1 = 0.743, significantly outperforming encoder models

Model TypeBest ModelMacro-average F1
Decoder LLMDictaLM2.00.743
Multi-label EncoderDictaBERT-Large0.678
Single-label EncoderDictaBERT-Large0.659
BaselineLinearSVC0.361

Key Findings

  1. Language-Specific Model Advantages: Hebrew-tuned DictaLM2.0 performed best on 8/12 identity categories
  2. Multi-label Learning Effectiveness: Multi-label encoders outperformed single-label combinations (0.678 vs 0.659)
  3. Decoder Advantages: Generative methods performed better on multi-label tasks

Cross-Domain Generalization

Testing on 500 parliamentary speech sentences showed macro-average F1 = 0.72, comparable to Facebook data performance, demonstrating the model's cross-domain generalization capability.

External Validation

Correlation analysis with the CHES-Israel expert survey showed that 16 out of 21 relevant correlations were significant at p ≤ 0.1 level, with 13 significant at p ≤ 0.05 level, with correlation coefficients ranging from |r| = 0.71 to 0.94.

Sociolinguistic Analysis

Identity Popularity Comparison

  • Cross-Platform Consistency: Socially-oriented, right-wing, and democratic identities were universally popular across data sources
  • Platform Differences: Honest and Zionist identities were more popular among the public, while socially-oriented identities were more prominent in parliament

Temporal Trend Analysis

  • Election Cycle Effects: Identity-related discourse peaked during three of four elections
  • Elite-Public Divergence:
    • Socially-oriented identity: Declining public endorsement, increasing political elite usage
    • Honest and democratic identities: Rising public endorsement, declining elite discourse

Identity Clustering Patterns

Factor analysis revealed major left-right polarization:

  • Left-wing Cluster: Left-wing, Democratic, Honest, Liberal, Palestinian
  • Right-wing Cluster: Right-wing, Conservative, Zionist, Security-oriented, Capitalist, Ultra-Orthodox

Gender Differences

  • Identity Expression Intensity: Women expressed more identities across all data sources
  • Identity Preferences:
    • Male tendency: Right-wing, Security-oriented, Capitalist, Ultra-Orthodox
    • Female tendency: Socially-oriented identity significantly favored across all platforms

Group Mention Detection

  • GRIT Dataset (Italian): Annotation of social group mentions in news and parliamentary texts
  • British Parliamentary Debates: Quantification of politicians' mentions of specific social groups

Frame and Stance Analysis

  • Us vs. Them Corpus: Target group, stance, and sentiment annotation in Reddit comments
  • U.S. Congressional Speeches: Sentiment classification and frame analysis of 140 years of immigration discourse

Ideology Inference

  • Traditional Methods: Left-right stance classification based on SVM and neural networks
  • Modern Methods: Zero-shot ideology scoring using LLMs

Conclusions and Discussion

Main Conclusions

  1. Hebrew-specific models significantly outperform general multilingual models on social identity detection tasks
  2. Multi-label learning methods better capture the complexity of identity expression
  3. Survey-data-based annotation frameworks provide culturally-sensitive identity category selection methods
  4. Cross-platform analysis reveals important differences between elite discourse and public endorsement

Limitations

  1. Temporal and Platform Scope: Data limited to specific periods, not covering other platforms such as Twitter
  2. Survey Population Limitations: Includes only Jewish citizens, lacking representation of Arab citizens
  3. Annotation Granularity: Selection based on 5% threshold may miss important but lower-frequency identities
  4. Model Bias: Classifiers may inherit biases from training data and pre-trained models

Future Directions

  1. Extension to more platforms and time periods
  2. Inclusion of more diverse population samples
  3. Development of methods to reduce model bias
  4. Exploration of dynamic annotation for emerging identity categories

In-Depth Evaluation

Strengths

  1. Methodological Innovation: First integration of large-scale survey data with text annotation, providing a culturally-sensitive research framework
  2. Technical Contribution: Establishment of strong baselines on low-resource languages, demonstrating the importance of language-specific models
  3. Experimental Comprehensiveness: Coverage of multiple model types, cross-domain evaluation, and external validation
  4. Social Science Value: Provision of deep insights into political discourse and identity dynamics

Weaknesses

  1. Data Representativeness: Survey sample limitations may affect the generalizability of identity categories
  2. Annotation Consistency: Relatively low κ values for certain categories (e.g., Conservative: 0.705)
  3. Evaluation Scope: Cross-domain evaluation based on only 500 samples may be insufficient

Impact

  1. Academic Value: Provision of important resources for computational social science and multilingual NLP
  2. Practical Value: Applicable to political communication analysis, opinion monitoring, and other applications
  3. Methodological Contribution: Provides a template for similar research in other non-English political contexts

Applicable Scenarios

  • Political communication research
  • Social identity analysis
  • Multilingual sentiment analysis
  • Political discourse monitoring
  • Cross-cultural comparative research

References

This paper cites important literature from multiple fields including social identity theory, computational linguistics, and political communication studies, particularly using Tajfel and Turner's (1979) Integrated Theory of Intergroup Conflict as theoretical foundation, as well as recent NLP research achievements in group mention detection and frame analysis.


Overall Assessment: This is a high-quality interdisciplinary research project with significant contributions in methodology, technical implementation, and social science insights. The research fills a gap in Hebrew-language political text analysis and makes valuable contributions to the development of multilingual NLP and computational social science.