Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
Ham, Choi, Yang et al.
Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.
academic
Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
Title: Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
Authors: Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim (Korea Advanced Institute of Science and Technology)
Classification: cs.CL (Computation and Language)
Publication Date: October 11, 2025 (arXiv preprint)
With major AI providers such as Google and OpenAI launching fine-tuning-as-a-service (FaaS), users can customize large language models (LLMs) using their own data. However, when user data contains harmful prompts, the service is vulnerable to safety degradation, a threat known as harmful finetuning attacks. Existing methods attempt to mitigate this by first constructing safety-aligned models and then fine-tuning on user data. However, this paper reveals that safety-aligned weights provide weak initialization for downstream task learning, resulting in suboptimal safety alignment and downstream task performance. To address this issue, the authors propose a Refusal-Teacher (Ref-Teacher) guided finetuning framework that directly fine-tunes the base model under the guidance of a safety-aligned Ref-Teacher, achieving dual improvements in safety and performance by filtering harmful prompts in user data and distilling safety alignment knowledge into the base model.
Harmful Finetuning Attacks: When users upload data containing harmful content for fine-tuning in FaaS, the model's safety alignment is compromised, causing the model to generate harmful content.
Limitations of Existing Methods:
Traditional two-stage pipelines (safety alignment followed by fine-tuning) have fundamental flaws
Safety-aligned models provide weak weight initialization for downstream task learning
This results in limited task performance and compromised safety
Research Motivation:
Direct fine-tuning on base models with both user data and safety alignment data can achieve better performance
However, this approach produces gradient conflicts, particularly exacerbated when user data contains harmful prompts
A new framework is needed to mitigate gradient conflicts while maintaining safety and task performance
Identified fundamental limitations of safety-aligned models: Demonstrated that safety-aligned LLMs provide weak initialization for downstream learning, leading to suboptimal task performance and safety compromises.
Proposed Ref-Teacher-guided finetuning framework: Mitigates gradient conflicts through two mechanisms—alignment distillation and data filtering—achieving dual improvements in safety and task performance.
Comprehensive experimental validation: Demonstrated the effectiveness and robustness of the method across multiple settings (varying harmful prompt ratios, data scales, dataset types, and model architectures).
Practical FaaS solution: Provides a practically viable solution for safe and reliable LLM deployment.
Input: Base LLM, user data (potentially containing harmful prompts), safety alignment data
Output: Customized model that maintains safety alignment while performing well on user-specific tasks
Constraints: Maintain robustness under harmful finetuning attacks
Enhanced Refusal Features: Through regularization terms, strengthens the discriminative ability of refusal features, making the cosine similarity between harmful prompt features and refusal features approach 1, and benign prompts approach -1.
Dynamic Refusal Feature Updates: Periodically updates refusal features during training, eliminating the need for pre-aligned models.
Synergistic Dual Mechanisms: Alignment distillation provides smooth loss surfaces while data filtering removes harmful data, with both mechanisms cooperatively mitigating gradient conflicts.
Alignment-stage Solutions: Obtaining robust safety-aligned weights through regularization techniques
Finetuning-stage Solutions: Freezing critical parameters or incorporating safety regularization
Post-finetuning Solutions: Analyzing discrepancies and editing model weights to compensate for safety degradation
The main distinction of this work from existing approaches is direct finetuning of base models rather than safety-aligned models, mitigating gradient conflicts through teacher guidance.
Safety-aligned weights are insufficient: Safety-aligned models provide weak initialization for downstream tasks, resulting in dual losses in performance and safety
Direct finetuning is more effective: Simultaneous safety alignment and task learning on base models achieves better results
Gradient conflict is the key challenge: Requires synergistic mitigation through alignment distillation and data filtering
Strong practicality: The method demonstrates stable performance across multiple settings, suitable for FaaS deployment
This paper cites important works in LLM safety, harmful finetuning attacks, and knowledge distillation, providing comprehensive literature foundation for related research. Particularly noteworthy are works on refusal features (Arditi et al. 2024) and existing harmful finetuning defense methods (Huang et al. 2024 series, Rosati et al. 2024, etc.).