2025-11-11T09:37:09.241544

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Ham, Choi, Yang et al.

Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.

academic

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Basic Information

Paper ID: 2506.07356
Title: Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
Authors: Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim (Korea Advanced Institute of Science and Technology)
Classification: cs.CL (Computation and Language)
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2506.07356

Abstract

With major AI providers such as Google and OpenAI launching fine-tuning-as-a-service (FaaS), users can customize large language models (LLMs) using their own data. However, when user data contains harmful prompts, the service is vulnerable to safety degradation, a threat known as harmful finetuning attacks. Existing methods attempt to mitigate this by first constructing safety-aligned models and then fine-tuning on user data. However, this paper reveals that safety-aligned weights provide weak initialization for downstream task learning, resulting in suboptimal safety alignment and downstream task performance. To address this issue, the authors propose a Refusal-Teacher (Ref-Teacher) guided finetuning framework that directly fine-tunes the base model under the guidance of a safety-aligned Ref-Teacher, achieving dual improvements in safety and performance by filtering harmful prompts in user data and distilling safety alignment knowledge into the base model.

Research Background and Motivation

Problem Definition

Harmful Finetuning Attacks: When users upload data containing harmful content for fine-tuning in FaaS, the model's safety alignment is compromised, causing the model to generate harmful content.
Limitations of Existing Methods:
- Traditional two-stage pipelines (safety alignment followed by fine-tuning) have fundamental flaws
- Safety-aligned models provide weak weight initialization for downstream task learning
- This results in limited task performance and compromised safety
Research Motivation:
- Direct fine-tuning on base models with both user data and safety alignment data can achieve better performance
- However, this approach produces gradient conflicts, particularly exacerbated when user data contains harmful prompts
- A new framework is needed to mitigate gradient conflicts while maintaining safety and task performance

Core Contributions

Identified fundamental limitations of safety-aligned models: Demonstrated that safety-aligned LLMs provide weak initialization for downstream learning, leading to suboptimal task performance and safety compromises.
Proposed Ref-Teacher-guided finetuning framework: Mitigates gradient conflicts through two mechanisms—alignment distillation and data filtering—achieving dual improvements in safety and task performance.
Comprehensive experimental validation: Demonstrated the effectiveness and robustness of the method across multiple settings (varying harmful prompt ratios, data scales, dataset types, and model architectures).
Practical FaaS solution: Provides a practically viable solution for safe and reliable LLM deployment.

Methodology Details

Task Definition

Input: Base LLM, user data (potentially containing harmful prompts), safety alignment data Output: Customized model that maintains safety alignment while performing well on user-specific tasks Constraints: Maintain robustness under harmful finetuning attacks

Model Architecture

1. Teacher Preparation Stage

Train the Ref-Teacher model to:

Generate soft refusal labels for alignment distillation
Effectively distinguish harmful and benign prompts using refusal features

Refusal Feature Definition:

R^l = (1/N_us) ∑(i=1 to N_us) f^l(x_us_i) - (1/N_s) ∑(i=1 to N_s) f^l(x_s_i)

Training Objective:

L_teacher = (1/N) ∑(i=1 to N) [ℓ(x_s_i, y_s_i) + ℓ(x_us_i, y_r_i) + λ{||1 + CS(f^l(x_s_i), R^l)||_2 + ||1 - CS(f^l(x_us_i), R^l)||_2}]

2. Finetuning Stage

Ref-Teacher guides the base model through two complementary mechanisms:

Data Filtering:

ω_i = {0, if CS(R^l, f^l(x_i)) > τ
       1, otherwise}

Alignment Distillation: Transfers soft label knowledge from Ref-Teacher to the student model using KL divergence loss

Overall Objective Function:

L_ft = (1/N_user) ∑(i=1 to N_user) ω_i * ℓ(x_i, y_i) + αT^2 * (1/N_align) ∑(i=1 to N_align) KL(p_Tt,i || p_Ts,i)

Technical Innovations

Enhanced Refusal Features: Through regularization terms, strengthens the discriminative ability of refusal features, making the cosine similarity between harmful prompt features and refusal features approach 1, and benign prompts approach -1.
Dynamic Refusal Feature Updates: Periodically updates refusal features during training, eliminating the need for pre-aligned models.
Synergistic Dual Mechanisms: Alignment distillation provides smooth loss surfaces while data filtering removes harmful data, with both mechanisms cooperatively mitigating gradient conflicts.

Experimental Setup

Datasets

Safety Alignment Data: BeaverTails (5,000 harmful prompts + refusal responses) + Alpaca (5,000 benign prompts + helpful responses)
User Data: GSM8K, SST2, AGNEWS, AlpacaEval, etc., with harmful prompts mixed in at varying ratios
Evaluation Data: BeaverTails test set (1,000 samples) for safety assessment

Evaluation Metrics

Harm Score (HS): Proportion of harmful responses among 1,000 outputs (↓ lower is better)
Finetuning Accuracy (FA): Accuracy on downstream tasks (↑ higher is better)

Baseline Methods

Alignment-stage methods: RepNoise, Vaccine, Booster
Finetuning-stage methods: LDIFS, Lisa
Baseline method: SFT (standard supervised finetuning)

Implementation Details

Models: Llama3-8B, Gemma2-9B, Qwen2-7B
Training: LoRA finetuning (rank=32), AdamW optimizer
Hyperparameters: λ=0.1, α=0.1, T=1, τ=0.9, learning rate 5e-4 (teacher)/1e-5 (finetuning)

Experimental Results

Main Results

Performance Under Varying Harmful Prompt Ratios

Method	p=0	p=0.1	p=0.3	p=0.5	Avg HS	Avg FA
SFT	2.2	16.2	57.3	71.3	36.8	39.5
Vaccine	1.3	5.4	35.0	57.5	24.8	22.0
Ref-Teacher	0.9	1.0	0.6	0.9	0.9	47.1

Ablation Studies

Gradient Conflict Analysis

Method	Alignment Distillation	Data Filtering	Conflict Frequency (%)	Avg Cosine Similarity
Base Method	✗	✗	35.09	0.110
+Alignment Distillation	✓	✗	32.26	0.131
+Data Filtering	✗	✓	36.11	0.102
Complete Method	✓	✓	30.02	0.140

Component Contribution Analysis

Alignment Distillation Only: HS=2.2, FA=46.2 (cannot independently address harmful data)
Data Filtering Only: HS=0.6, FA=46.5 (reduces harm but impacts task performance)
Complete Method: HS=0.5, FA=49.0 (synergistic combination achieves best performance)

Generalization Experiments

Cross-Dataset Generalization

Average performance on GSM8K, SST2, AGNEWS, AlpacaEval:

Ref-Teacher: HS=1.1, FA=52.8 (best)
Best baseline (Booster): HS=10.0, FA=51.3

Cross-Model Architecture Generalization

Average performance on Llama3-8B, Gemma2-9B, Qwen2-7B:

Ref-Teacher: HS=0.8, FA=60.8 (best)
Best baseline (Booster): HS=4.4, FA=57.3

Classification Performance Verification

Ref-Teacher's F1 scores on harmful content detection:

BeaverTails: 93.4%
JailbreakBench: 79.8%
GCG attacks: 92.9%
AutoDAN attacks: 82.1%

LLM Safety Research

Training-time Defense: Enhancing robustness through adversarial training, data balancing, and other methods
Inference-time Defense: Leveraging LLM self-assessment of harmfulness or internal discrepancies for protection

Harmful Finetuning Attack Defense

Alignment-stage Solutions: Obtaining robust safety-aligned weights through regularization techniques
Finetuning-stage Solutions: Freezing critical parameters or incorporating safety regularization
Post-finetuning Solutions: Analyzing discrepancies and editing model weights to compensate for safety degradation

The main distinction of this work from existing approaches is direct finetuning of base models rather than safety-aligned models, mitigating gradient conflicts through teacher guidance.

Conclusions and Discussion

Main Conclusions

Safety-aligned weights are insufficient: Safety-aligned models provide weak initialization for downstream tasks, resulting in dual losses in performance and safety
Direct finetuning is more effective: Simultaneous safety alignment and task learning on base models achieves better results
Gradient conflict is the key challenge: Requires synergistic mitigation through alignment distillation and data filtering
Strong practicality: The method demonstrates stable performance across multiple settings, suitable for FaaS deployment

Limitations

Dependence on refusal features: If refusal features are compromised by adversarial attacks, the entire framework's safety may be jeopardized
Computational overhead: Requires additional Ref-Teacher model training, increasing computational costs
Data quality dependency: Method effectiveness depends on the quality and coverage of safety alignment data

Future Directions

Robustness Enhancement: Research defense methods against adversarial refusal feature manipulation
Efficiency Optimization: Explore more efficient teacher training and knowledge distillation strategies
Theoretical Analysis: Deepen understanding of the mathematical nature of gradient conflicts and mitigation mechanisms

In-Depth Evaluation

Strengths

Deep Problem Discovery: First systematic identification of fundamental limitations in safety-aligned weights, providing new perspectives for the field
Clever Method Design: Elegantly solves gradient conflict problems through refusal feature and dual mechanism design
Comprehensive Experiments: Covers multiple settings, datasets, and models with rigorous experimental design and convincing results
High Practical Value: Directly addresses FaaS scenarios with strong practical application value

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of gradient conflict phenomena and mitigation mechanisms
Computational Cost Considerations: Insufficient discussion of computational overhead from additional Ref-Teacher training
Limited Attack Models: Primarily considers data poisoning attacks; robustness against more complex adversarial attacks requires verification
Hyperparameter Sensitivity: While ablation studies are provided, sensitivity analysis of critical hyperparameters is insufficient

Impact

Academic Contribution: Provides new research paradigm for LLM safety finetuning, potentially inspiring subsequent research
Industrial Value: Directly solves practical FaaS security problems with significant commercial application prospects
Reproducibility: Provides detailed experimental settings and hyperparameters, facilitating reproduction and improvement

Applicable Scenarios

FaaS Platforms: Safety assurance for AI service providers' finetuning services
Customized LLMs: Safety solutions for enterprise internal LLM customization and deployment
Multi-task Learning: LLM training scenarios requiring simultaneous optimization of multiple objectives
Safety-Critical Applications: LLM application domains with high safety requirements

References

This paper cites important works in LLM safety, harmful finetuning attacks, and knowledge distillation, providing comprehensive literature foundation for related research. Particularly noteworthy are works on refusal features (Arditi et al. 2024) and existing harmful finetuning defense methods (Huang et al. 2024 series, Rosati et al. 2024, etc.).