2025-11-13T03:34:10.171136

Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data

Chen, Zhang, Lin et al.

Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.

academic

Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data

Basic Information

Paper ID: 2510.10677
Title: Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
Authors: Zhuowei Chen, Bowei Zhang, Nankai Lin, Tian Hou, Lianxi Wang
Classification: cs.CL (Computational Linguistics)
Publication Date: October 12, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10677

Abstract

As the capabilities of large language models (LLMs) advance, the risks of malicious requests increase, highlighting the need for effective LLM safeguard detection systems. Existing methods rely primarily on classifier-based approaches that lack interpretability and perform poorly on low-resource languages. To address these limitations, this paper proposes ConsistentGuard, a novel reasoning-based multilingual safeguard system that enhances interpretability through reasoning and facilitates cross-lingual knowledge transfer through alignment. Using only 1,000 training samples, the method demonstrates superior performance across six languages on three datasets, surpassing larger models trained on substantially more data, while exhibiting strong interpretability and generalization capabilities.

Research Background and Motivation

Problem Definition

Core Problem: Existing LLM safeguard methods show significant performance degradation on low-resource languages and lack interpretability
Significance: With the proliferation of LLM applications, the demand for multilingual safety protection is increasingly urgent
Limitations of Existing Methods:
- Classifier-based approaches lack interpretability and evidentiary support
- Substantial performance decline on low-resource languages (e.g., Bengali)
- Neglect of cross-lingual reasoning consistency issues
Research Motivation: Construct a safeguard framework that possesses reasoning capabilities while maintaining consistency across multiple languages

Core Contributions

Propose ConsistentGuard Framework: A reasoning-based multilingual safeguard training framework that enhances interpretability, effectiveness, and cross-lingual generalization
Design CAO Algorithm: Introduce Constrained Alignment Optimization to address cross-lingual reasoning inconsistency
Achieve Data-Efficient Training: Attain excellent performance across six languages on three datasets using only 1,000 training samples
Construct Multilingual Benchmark: Extend existing English safety benchmarks to six languages and open-source code and data

Methodology Details

Task Definition

Input: User query text (multiple languages) Output: Safety judgment (harmful/non-harmful) + reasoning process + violation category Constraints: Maintain cross-lingual reasoning consistency and provide interpretable judgment rationale

Model Architecture

ConsistentGuard employs a three-stage training framework:

1. Cold Start Stage

Objective: Knowledge distillation through supervised fine-tuning (SFT)
Method: Use DeepSeek V3 671B as teacher model to generate training data with three-step reasoning:
- Understanding: Comprehend dialogue content
- Rule Matching: Match relevant judgment principles
- Judgment: Analyze whether principles are violated
Data Construction: Randomly sample 1,000 examples from four English safety datasets

2. Reasoning Training Stage

Algorithm: Group Relative Policy Optimization (GRPO)
Reward Function Design:

r = sin(L/(2·Lbest)·π) + [sin((p-2)/2·π) + 1]

where L is reasoning length, Lbest is optimal length (set to 512), and p is triplet repetition rate

Reward Components:
- Accuracy reward: Correctness of judgment
- Format reward: Output format compliance
- Length reward: Stable reasoning length control
- Diversity reward: Prevent length reward exploitation

3. Cross-lingual Alignment Stage

Algorithm: Constrained Alignment Optimization (CAO)
Data Construction:
- Translate English data to 5 languages
- Construct failure and success sets
- Synthesize alignment samples: failure input + success output + anchor sample
Optimization Objective:

LCAO = -E[log σ(β log πθ(pw|q)/πref(pw|q) - β log πθ(pl|q)/πref(pl|q))]
Lc = Dkl[πθ(qa⊕pa)||πref(qa⊕pa)]
L = LCAO + Lc

Technical Innovations

Dual Reward Mechanism: Skillfully balance reasoning length and diversity, preventing excessively long reasoning from affecting efficiency
Constrained Alignment Optimization: Constrain optimization direction through global regularization term, preventing performance degradation in high-resource languages
Three-Stage Progressive Training: Systematic approach from knowledge distillation to reasoning enhancement to cross-lingual alignment
Data-Efficient Design: Achieve performance comparable to large-scale trained models using only 1,000 samples

Experimental Setup

Datasets

Training Data: Mixed four open-source safety datasets, randomly sampled 1,000 examples
- Aegis, BeaverTails, ToxicChat, WildGuard
Evaluation Datasets: Three widely-used safety benchmarks
- OpenAI Moderation
- ToxicChat
- SimpleSafetyTests
Language Coverage: English, French, Chinese, Japanese, Bengali, Hindi

Evaluation Metrics

Primary Metric: Macro-averaged F1 score
Auxiliary Analysis: Interpretability evaluation, cross-lingual consistency analysis

Baseline Methods

Llama Guard 3 (1B/8B)
ShieldGemma (2B/9B)
GuardReasoner (3B)

Implementation Details

Base Model: Qwen2.5-3B
Hardware: Two NVIDIA A100 40G GPUs
Optimal Reasoning Length: 512 tokens
Training Samples: Only 1,000 English examples

Experimental Results

Main Results

On OpenAI Moderation dataset:

English: 78.94 (second place, only behind Llama Guard 3 8B's 79.69)
Low-Resource Language Performance:
- Bengali: 72.10 (surpassing multiple baselines)
- Hindi: 73.26 (excellent performance)

On ToxicChat dataset:

English: 84.26 (comparable to GuardReasoner)
Cross-lingual Stability: Small performance gaps across languages

Ablation Studies

Reasoning Training Ablation

SFT baseline vs. reasoning training: Reasoning training brings significant improvements across all languages
Dual reward mechanism effectiveness: R1-GRPO outperforms standard GRPO

Alignment Method Ablation

CAO vs. DPO: CAO brings performance improvements on most languages, while DPO shows unstable results
CAO demonstrates more pronounced improvements on low-resource languages

Key Findings

Data Efficiency: Achieve comparable performance to models trained with 127,600 samples using only 1,000 samples
Cross-lingual Generalization: Reasoning training significantly enhances cross-lingual generalization capability
Alignment Effectiveness: CAO effectively narrows performance gaps between languages, particularly for low-resource languages
Interpretability: Model provides detailed reasoning process, explaining violation reasons and relevant principles

LLM Safeguards

Existing methods primarily based on classifiers (Llama Guard, ShieldGemma)
Lack interpretability and cross-lingual capabilities
This paper is the first to systematically address multilingual safeguard problems

Reasoning-Enhanced Training

Built upon CoT, self-improvement, and other methods
Optimizes reasoning length and diversity for safeguard tasks
Balances reasoning depth with response latency trade-offs

Cross-lingual Knowledge Generalization

Existing research primarily focuses on cross-lingual alignment for QA tasks
This paper is the first to apply cross-lingual alignment to safeguards
Proposes constrained optimization to prevent high-resource language performance degradation

Conclusions and Discussion

Main Conclusions

Reasoning-enhanced multilingual safeguard framework significantly improves performance and interpretability
Constrained alignment optimization effectively addresses cross-lingual reasoning inconsistency
Data-efficient training strategies hold important value in resource-constrained scenarios
Systematic three-stage training framework provides new paradigm for multilingual AI safety

Limitations

Limited Language Coverage: Only validates six languages; generalization to other low-resource languages remains uncertain
Model Scale Constraints: Only verified on 3B parameter models; effectiveness on larger models unknown
Training Data Scale: 1,000 samples relatively small; effects of larger-scale data unexplored
Evaluation Dimensions: Primarily focuses on classification accuracy; lacks comprehensive evaluation such as human preferences
Explanation Quality: Difficult to assess reasoning explanation quality; lacks ground truth answers

Future Directions

Extend to more low-resource languages and language families
Validate method effectiveness on larger-scale models
Develop automatic evaluation methods for reasoning explanation quality
Explore safeguards for long-form text and dialogue scenarios

In-Depth Evaluation

Strengths

Strong Problem Targeting: Directly addresses core pain points of existing methods on low-resource languages
High Methodological Innovation:
- First systematic solution to multilingual safeguard problems
- Ingeniously designed constrained alignment optimization algorithm
- Dual reward mechanism balancing multiple objectives
Comprehensive Experimental Design:
- Multi-dataset, multi-language validation
- Detailed ablation studies
- Comparison with multiple strong baselines
High Practical Value: Data-efficient and easy to deploy
Open-Source Contribution: Provides code and extended benchmarks

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for method effectiveness
Evaluation Limitations:
- Relatively limited language coverage
- Lacks quantitative evaluation of explanation quality
- Does not consider cultural differences in safety standards
Method Complexity: Three-stage training increases implementation complexity
Benchmark Construction: Machine translation may introduce semantic deviations

Impact

Academic Contribution: Opens new research direction for multilingual AI safety
Practical Value: Provides safety protection solutions for globalized AI applications
Reproducibility: Open-source code and data support subsequent research
Inspirational Value: Reasoning + alignment framework extensible to other multilingual tasks

Applicable Scenarios

Multilingual AI Services: Globalized dialogue systems and content generation platforms
Resource-Constrained Environments: Small model deployment scenarios
High-Safety-Requirement Applications: Systems requiring interpretable safeguards
Cross-lingual Consistency Requirements: Multilingual platforms requiring unified safety standards

References

The paper cites extensive related work, primarily including:

LLM Safeguards: Llama Guard, ShieldGemma, GuardReasoner, etc.
Reasoning-Enhanced Methods: Chain-of-Thought, self-improvement, adversarial debate, etc.
Cross-lingual Methods: Multilingual pretraining, instruction tuning, direct preference optimization, etc.
Evaluation Benchmarks: OpenAI Moderation, ToxicChat, SimpleSafetyTests, etc.

Overall Assessment: This is a high-quality research paper that proposes innovative solutions to the important and challenging problem of multilingual AI safety. The methodology is well-designed, experimental validation is comprehensive, and it possesses significant academic and practical value. Despite some limitations, it makes important contributions to the field's development.