With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.
- Paper ID: 2510.08776
- Title: Measuring Moral LLM Responses in Multilingual Capacities
- Authors: Kimaya Basu, Savi Kolari, Allison Yu
- Classification: cs.CL cs.AI
- Publication Date: October 9, 2025 (ArXiv Preprint)
- Paper Link: https://arxiv.org/abs/2510.08776
With the widespread global deployment of large language models (LLMs), there is an increasing need to understand and regulate their multilingual responses. This study evaluates state-of-the-art models and leading open-source models across five dimensions spanning low-resource and high-resource languages to measure LLM accuracy and consistency in multilingual contexts. The research employs a five-point rating scale and LLM-based judges for evaluation. Results demonstrate that GPT-5 achieves the best average performance across all categories, while other models exhibit greater inconsistency across languages and categories. Notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT achieves the highest scores (averaging 3.56 and 4.73 respectively), while Gemini 2.5 Pro scores lowest (averaging 1.39 and 1.98 respectively).
This study addresses the following key questions:
- Multilingual Moral Consistency: Whether LLMs maintain consistent ethical and moral responses across different linguistic environments
- Language Sensitivity of Safety Mechanisms: The effectiveness of existing safety safeguards in non-English languages
- Cross-linguistic Bias and Stereotypes: Whether models exhibit varying degrees of bias across different languages
- Global Application Demands: LLMs are becoming daily tools for global users, necessitating cross-linguistic reliability assurance
- Safety Concerns: Research indicates that LLM safety mechanisms perform poorly in non-English languages and are vulnerable to malicious exploitation
- Cultural Differences in Moral Standards: Moral judgments may vary significantly across different linguistic and cultural backgrounds
- English-Centric Test Data: Existing benchmarks primarily focus on English environments
- Lack of Systematic Assessment: Absence of comprehensive evaluation frameworks across multiple moral dimensions
- Language Blind Spots in Safety Mechanisms: Existing research reveals vulnerabilities in safety protocols for low-resource languages
- Construction of Multi-dimensional Multilingual Moral Evaluation Dataset: Creation of a comprehensive evaluation dataset containing 5 categories, 100 questions each, covering 6 languages
- Systematic Cross-linguistic Moral Analysis: First systematic multilingual moral response analysis of mainstream LLMs
- Discovery of Safety Mechanism Vulnerabilities: Revelation of significant differences in LLM safety mechanisms across languages
- Establishment of Model Performance Benchmarks: Provision of benchmarks and evaluation frameworks for future multilingual moral AI research
Input: Moral and ethical questions across 6 languages (English, Chinese, Spanish, Arabic, Hindi, Swahili)
Output: Moral appropriateness scores of model responses (5-point scale)
Constraints: Responses must be within 50 words; moral judgment questions require explicit yes/no/maybe answers
The research team constructed a dataset containing 500 English questions divided into five categories:
- Biases & Stereotypes: Tests whether models perpetuate harmful biases
- Consent & Autonomy: Tests safety mechanisms using deceptive prompts
- Harm Prevention & Safety: Tests safety protocols through indirect questions
- Legality: Tests model handling of regional legal differences
- Moral Judgment: Assesses model adherence to core moral principles
- High-resource Languages: Chinese, Spanish, Arabic
- Low-resource Languages: Hindi, Swahili
- Selection Criteria: Linguistic structure diversity, cultural background differences, writing system variations
- State-of-the-art Models: GPT-5, Gemini 2.5 Pro, Claude Sonnet 4
- Open-source Models: Llama 4 Scout, Qwen3 235B-a22b
- Selection Rationale: Training data, application objectives, regional characteristics, openness
- Evaluating Model: Gemini 2.5 Pro as primary judge
- Scoring Criteria: 5-point scale considering answer accuracy and reasoning quality
- Consistency Verification: Cross-validation using GPT-5 and Qwen3
- Translation of English questions to target languages (using Googletrans)
- Model generation of responses in target language
- Translation of responses back to English for evaluation
- Scoring based on category-specific rubrics
- Total Questions: 500 original English questions
- Language Coverage: 6 languages × 500 questions = 3,000 test samples
- Category Distribution: 100 questions per category, uniformly distributed
- Translation Tool: Googletrans Python package
- Primary Metric: 5-point scale scoring (1=worst, 5=best)
- Category-specific Metrics: Specialized scoring criteria for each moral category
- Consistency Measurement: Standard deviation analysis of cross-linguistic responses
- Temperature Setting: 0.7 (to reduce random variation)
- Response Limit: Within 50 words
- System Prompt: Unified instruction format
- GPT-5: Average score 92%, best performance across all categories
- Claude Sonnet 4: Stable performance, excellent in safety categories
- Gemini 2.5 Pro: Excellent in academic categories, poor in safety categories
- Llama 4 Scout: Moderate performance
- Qwen3 235B: Average score 66%, worst overall performance
Significant Differences in Safety Categories:
- Consent & Autonomy Category: GPT-5 (3.56) vs Gemini 2.5 Pro (1.39)
- Harm Prevention & Safety Category: GPT-5 (4.73) vs Gemini 2.5 Pro (1.98)
Impact of Language Resource Level:
- Models score higher in low-resource languages on deceptive question categories
- Models are more easily "deceived" into providing harmful information in high-resource languages
Model-Specific Performance:
- Gemini 2.5 Pro: Excellent in direct categories (bias, legality, moral judgment), extremely poor in indirect categories
- Qwen3: Exhibits clear regional bias in Chinese legal questions
- Random sampling validation of translation accuracy
- Score variance controlled within 1 point
- Cross-validation ensures assessment consistency
- Gemini shows no obvious bias toward its own responses
- Qwen scores average 0.5 points lower
- GPT-5 scores average 0.6 points higher
The paper provides typical response examples demonstrating:
- GPT-5 refuses to provide harmful information on safety questions
- Gemini 2.5 Pro is successfully "deceived" on certain deceptive questions
- Qwen3 exhibits China-law-oriented bias on legal questions
- Psychological Tool Adaptation: Application of psychological instruments such as the Defining Issues Test (DIT) to LLMs
- Philosophical Framework Analysis: Assessment of utilitarian vs. deontological moral reasoning
- Limitations: Existing methods have limited scope and lack cross-linguistic perspectives
- Reasoning Ability Testing: Cross-linguistic testing of moral dilemmas such as the trolley problem
- Factual Accuracy: Consistency of factual responses across languages
- Performance Disparities: Superior performance in high-resource languages compared to low-resource languages
- Jailbreak Attacks: Circumventing safety mechanisms through non-English languages
- Large-scale Benchmarks: Safety performance testing across 100+ languages
- Vulnerability Discovery: Safety protocol vulnerabilities in low-resource languages
- Significant Inter-model Differences: GPT-5 significantly outperforms other models in moral and safety responses
- Language Sensitivity: All models exhibit varying degrees of performance degradation in non-English languages
- Safety Mechanism Vulnerabilities: Success rates of deceptive questions vary significantly across languages
- Regional Bias Exists: Certain models exhibit clear regional legal bias
- Translation Dependency: Reliance on Google Translate may introduce errors
- Lack of Human Baseline: No collection of human responses for comparison
- Scale Subjectivity: Evaluation scales may not fully reflect societal values
- Limited Language Coverage: Only 6 languages tested, limited representativeness
- Expanded Language Coverage: Extension to all languages supported by Google Translate
- Human Baseline Establishment: Collection of human responses from diverse cultural backgrounds
- Wording Effect Research: In-depth investigation of question formulation effects on responses
- Safety Mechanism Improvement: Enhancement of multilingual safety protocols based on discovered vulnerabilities
- Significant Research Value: First systematic evaluation of LLM cross-linguistic moral responses, filling an important research gap
- Rigorous Methodology: Adoption of comprehensive evaluation framework spanning multiple models, languages, and dimensions
- Practical Value of Findings: Revealed safety vulnerabilities provide important guidance for actual deployment
- Dataset Contribution: Constructed multilingual moral evaluation dataset provides benchmarks for subsequent research
- Translation Quality Control: Over-reliance on machine translation may affect result reliability
- Insufficient Cultural Context Consideration: Inadequate consideration of moral standard differences across cultural backgrounds
- Sample Size Limitations: Only 100 questions per category may be insufficient to cover complex moral scenarios
- Single Evaluation Standard: Primary reliance on single LLM judge may introduce systematic bias
- Academic Contribution: Establishes new research paradigm for multilingual AI ethics research
- Practical Value: Provides important risk assessment tools for AI safety deployment
- Policy Impact: Research results can inform AI governance and regulatory policy development
- Technical Advancement: Promotes development of multilingual AI safety technologies
- AI Safety Assessment: LLM safety evaluation for enterprises and research institutions
- Multilingual AI Deployment: Guidance for risk control in cross-linguistic AI applications
- Regulatory Compliance: Assistance for regulatory bodies in establishing AI ethics standards
- Academic Research: Foundation for AI ethics and multilingual NLP research
This paper cites multiple important related studies:
- Achiam et al. (2023) - GPT-4 Technical Report
- Jin et al. (2024) - Multilingual Trolley Problem Research
- Fu and Liu (2025) - Multilingual LLM Judge Reliability Research
- Lin et al. (2025) - LLM Jailbreak Attacks via Safety Papers
- Zheng et al. (2023) - LLM-as-a-Judge Evaluation Methodology
Overall Assessment: This is groundbreaking research that systematically reveals important issues in current LLMs' multilingual moral responses. Despite certain methodological limitations, its findings hold significant theoretical and practical value for AI safety and multilingual AI development. This research establishes an important foundation for future multilingual AI ethics research.