2025-11-13T03:34:10.171136

Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data

Chen, Zhang, Lin et al.
Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.
academic

Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data

Basic Information

  • Paper ID: 2510.10677
  • Title: Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
  • Authors: Zhuowei Chen, Bowei Zhang, Nankai Lin, Tian Hou, Lianxi Wang
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 12, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10677

Abstract

As the capabilities of large language models (LLMs) advance, the risks of malicious requests increase, highlighting the need for effective LLM safeguard detection systems. Existing methods rely primarily on classifier-based approaches that lack interpretability and perform poorly on low-resource languages. To address these limitations, this paper proposes ConsistentGuard, a novel reasoning-based multilingual safeguard system that enhances interpretability through reasoning and facilitates cross-lingual knowledge transfer through alignment. Using only 1,000 training samples, the method demonstrates superior performance across six languages on three datasets, surpassing larger models trained on substantially more data, while exhibiting strong interpretability and generalization capabilities.

Research Background and Motivation

Problem Definition

  1. Core Problem: Existing LLM safeguard methods show significant performance degradation on low-resource languages and lack interpretability
  2. Significance: With the proliferation of LLM applications, the demand for multilingual safety protection is increasingly urgent
  3. Limitations of Existing Methods:
    • Classifier-based approaches lack interpretability and evidentiary support
    • Substantial performance decline on low-resource languages (e.g., Bengali)
    • Neglect of cross-lingual reasoning consistency issues
  4. Research Motivation: Construct a safeguard framework that possesses reasoning capabilities while maintaining consistency across multiple languages

Core Contributions

  1. Propose ConsistentGuard Framework: A reasoning-based multilingual safeguard training framework that enhances interpretability, effectiveness, and cross-lingual generalization
  2. Design CAO Algorithm: Introduce Constrained Alignment Optimization to address cross-lingual reasoning inconsistency
  3. Achieve Data-Efficient Training: Attain excellent performance across six languages on three datasets using only 1,000 training samples
  4. Construct Multilingual Benchmark: Extend existing English safety benchmarks to six languages and open-source code and data

Methodology Details

Task Definition

Input: User query text (multiple languages) Output: Safety judgment (harmful/non-harmful) + reasoning process + violation category Constraints: Maintain cross-lingual reasoning consistency and provide interpretable judgment rationale

Model Architecture

ConsistentGuard employs a three-stage training framework:

1. Cold Start Stage

  • Objective: Knowledge distillation through supervised fine-tuning (SFT)
  • Method: Use DeepSeek V3 671B as teacher model to generate training data with three-step reasoning:
    • Understanding: Comprehend dialogue content
    • Rule Matching: Match relevant judgment principles
    • Judgment: Analyze whether principles are violated
  • Data Construction: Randomly sample 1,000 examples from four English safety datasets

2. Reasoning Training Stage

  • Algorithm: Group Relative Policy Optimization (GRPO)
  • Reward Function Design:
r = sin(L/(2·Lbest)·π) + [sin((p-2)/2·π) + 1]

where L is reasoning length, Lbest is optimal length (set to 512), and p is triplet repetition rate

  • Reward Components:
    • Accuracy reward: Correctness of judgment
    • Format reward: Output format compliance
    • Length reward: Stable reasoning length control
    • Diversity reward: Prevent length reward exploitation

3. Cross-lingual Alignment Stage

  • Algorithm: Constrained Alignment Optimization (CAO)
  • Data Construction:
    • Translate English data to 5 languages
    • Construct failure and success sets
    • Synthesize alignment samples: failure input + success output + anchor sample
  • Optimization Objective:
LCAO = -E[log σ(β log πθ(pw|q)/πref(pw|q) - β log πθ(pl|q)/πref(pl|q))]
Lc = Dkl[πθ(qa⊕pa)||πref(qa⊕pa)]
L = LCAO + Lc

Technical Innovations

  1. Dual Reward Mechanism: Skillfully balance reasoning length and diversity, preventing excessively long reasoning from affecting efficiency
  2. Constrained Alignment Optimization: Constrain optimization direction through global regularization term, preventing performance degradation in high-resource languages
  3. Three-Stage Progressive Training: Systematic approach from knowledge distillation to reasoning enhancement to cross-lingual alignment
  4. Data-Efficient Design: Achieve performance comparable to large-scale trained models using only 1,000 samples

Experimental Setup

Datasets

  • Training Data: Mixed four open-source safety datasets, randomly sampled 1,000 examples
    • Aegis, BeaverTails, ToxicChat, WildGuard
  • Evaluation Datasets: Three widely-used safety benchmarks
    • OpenAI Moderation
    • ToxicChat
    • SimpleSafetyTests
  • Language Coverage: English, French, Chinese, Japanese, Bengali, Hindi

Evaluation Metrics

  • Primary Metric: Macro-averaged F1 score
  • Auxiliary Analysis: Interpretability evaluation, cross-lingual consistency analysis

Baseline Methods

  • Llama Guard 3 (1B/8B)
  • ShieldGemma (2B/9B)
  • GuardReasoner (3B)

Implementation Details

  • Base Model: Qwen2.5-3B
  • Hardware: Two NVIDIA A100 40G GPUs
  • Optimal Reasoning Length: 512 tokens
  • Training Samples: Only 1,000 English examples

Experimental Results

Main Results

On OpenAI Moderation dataset:

  • English: 78.94 (second place, only behind Llama Guard 3 8B's 79.69)
  • Low-Resource Language Performance:
    • Bengali: 72.10 (surpassing multiple baselines)
    • Hindi: 73.26 (excellent performance)

On ToxicChat dataset:

  • English: 84.26 (comparable to GuardReasoner)
  • Cross-lingual Stability: Small performance gaps across languages

Ablation Studies

Reasoning Training Ablation

  • SFT baseline vs. reasoning training: Reasoning training brings significant improvements across all languages
  • Dual reward mechanism effectiveness: R1-GRPO outperforms standard GRPO

Alignment Method Ablation

  • CAO vs. DPO: CAO brings performance improvements on most languages, while DPO shows unstable results
  • CAO demonstrates more pronounced improvements on low-resource languages

Key Findings

  1. Data Efficiency: Achieve comparable performance to models trained with 127,600 samples using only 1,000 samples
  2. Cross-lingual Generalization: Reasoning training significantly enhances cross-lingual generalization capability
  3. Alignment Effectiveness: CAO effectively narrows performance gaps between languages, particularly for low-resource languages
  4. Interpretability: Model provides detailed reasoning process, explaining violation reasons and relevant principles

LLM Safeguards

  • Existing methods primarily based on classifiers (Llama Guard, ShieldGemma)
  • Lack interpretability and cross-lingual capabilities
  • This paper is the first to systematically address multilingual safeguard problems

Reasoning-Enhanced Training

  • Built upon CoT, self-improvement, and other methods
  • Optimizes reasoning length and diversity for safeguard tasks
  • Balances reasoning depth with response latency trade-offs

Cross-lingual Knowledge Generalization

  • Existing research primarily focuses on cross-lingual alignment for QA tasks
  • This paper is the first to apply cross-lingual alignment to safeguards
  • Proposes constrained optimization to prevent high-resource language performance degradation

Conclusions and Discussion

Main Conclusions

  1. Reasoning-enhanced multilingual safeguard framework significantly improves performance and interpretability
  2. Constrained alignment optimization effectively addresses cross-lingual reasoning inconsistency
  3. Data-efficient training strategies hold important value in resource-constrained scenarios
  4. Systematic three-stage training framework provides new paradigm for multilingual AI safety

Limitations

  1. Limited Language Coverage: Only validates six languages; generalization to other low-resource languages remains uncertain
  2. Model Scale Constraints: Only verified on 3B parameter models; effectiveness on larger models unknown
  3. Training Data Scale: 1,000 samples relatively small; effects of larger-scale data unexplored
  4. Evaluation Dimensions: Primarily focuses on classification accuracy; lacks comprehensive evaluation such as human preferences
  5. Explanation Quality: Difficult to assess reasoning explanation quality; lacks ground truth answers

Future Directions

  1. Extend to more low-resource languages and language families
  2. Validate method effectiveness on larger-scale models
  3. Develop automatic evaluation methods for reasoning explanation quality
  4. Explore safeguards for long-form text and dialogue scenarios

In-Depth Evaluation

Strengths

  1. Strong Problem Targeting: Directly addresses core pain points of existing methods on low-resource languages
  2. High Methodological Innovation:
    • First systematic solution to multilingual safeguard problems
    • Ingeniously designed constrained alignment optimization algorithm
    • Dual reward mechanism balancing multiple objectives
  3. Comprehensive Experimental Design:
    • Multi-dataset, multi-language validation
    • Detailed ablation studies
    • Comparison with multiple strong baselines
  4. High Practical Value: Data-efficient and easy to deploy
  5. Open-Source Contribution: Provides code and extended benchmarks

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for method effectiveness
  2. Evaluation Limitations:
    • Relatively limited language coverage
    • Lacks quantitative evaluation of explanation quality
    • Does not consider cultural differences in safety standards
  3. Method Complexity: Three-stage training increases implementation complexity
  4. Benchmark Construction: Machine translation may introduce semantic deviations

Impact

  1. Academic Contribution: Opens new research direction for multilingual AI safety
  2. Practical Value: Provides safety protection solutions for globalized AI applications
  3. Reproducibility: Open-source code and data support subsequent research
  4. Inspirational Value: Reasoning + alignment framework extensible to other multilingual tasks

Applicable Scenarios

  1. Multilingual AI Services: Globalized dialogue systems and content generation platforms
  2. Resource-Constrained Environments: Small model deployment scenarios
  3. High-Safety-Requirement Applications: Systems requiring interpretable safeguards
  4. Cross-lingual Consistency Requirements: Multilingual platforms requiring unified safety standards

References

The paper cites extensive related work, primarily including:

  • LLM Safeguards: Llama Guard, ShieldGemma, GuardReasoner, etc.
  • Reasoning-Enhanced Methods: Chain-of-Thought, self-improvement, adversarial debate, etc.
  • Cross-lingual Methods: Multilingual pretraining, instruction tuning, direct preference optimization, etc.
  • Evaluation Benchmarks: OpenAI Moderation, ToxicChat, SimpleSafetyTests, etc.

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to the important and challenging problem of multilingual AI safety. The methodology is well-designed, experimental validation is comprehensive, and it possesses significant academic and practical value. Despite some limitations, it makes important contributions to the field's development.