Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
Chen, Zhang, Lin et al.
Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.
academic
Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
As the capabilities of large language models (LLMs) advance, the risks of malicious requests increase, highlighting the need for effective LLM safeguard detection systems. Existing methods rely primarily on classifier-based approaches that lack interpretability and perform poorly on low-resource languages. To address these limitations, this paper proposes ConsistentGuard, a novel reasoning-based multilingual safeguard system that enhances interpretability through reasoning and facilitates cross-lingual knowledge transfer through alignment. Using only 1,000 training samples, the method demonstrates superior performance across six languages on three datasets, surpassing larger models trained on substantially more data, while exhibiting strong interpretability and generalization capabilities.
Propose ConsistentGuard Framework: A reasoning-based multilingual safeguard training framework that enhances interpretability, effectiveness, and cross-lingual generalization
Design CAO Algorithm: Introduce Constrained Alignment Optimization to address cross-lingual reasoning inconsistency
Achieve Data-Efficient Training: Attain excellent performance across six languages on three datasets using only 1,000 training samples
Construct Multilingual Benchmark: Extend existing English safety benchmarks to six languages and open-source code and data
Dual Reward Mechanism: Skillfully balance reasoning length and diversity, preventing excessively long reasoning from affecting efficiency
Constrained Alignment Optimization: Constrain optimization direction through global regularization term, preventing performance degradation in high-resource languages
Three-Stage Progressive Training: Systematic approach from knowledge distillation to reasoning enhancement to cross-lingual alignment
Data-Efficient Design: Achieve performance comparable to large-scale trained models using only 1,000 samples
The paper cites extensive related work, primarily including:
LLM Safeguards: Llama Guard, ShieldGemma, GuardReasoner, etc.
Reasoning-Enhanced Methods: Chain-of-Thought, self-improvement, adversarial debate, etc.
Cross-lingual Methods: Multilingual pretraining, instruction tuning, direct preference optimization, etc.
Evaluation Benchmarks: OpenAI Moderation, ToxicChat, SimpleSafetyTests, etc.
Overall Assessment: This is a high-quality research paper that proposes innovative solutions to the important and challenging problem of multilingual AI safety. The methodology is well-designed, experimental validation is comprehensive, and it possesses significant academic and practical value. Despite some limitations, it makes important contributions to the field's development.