2025-11-11T09:37:09.241544

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Ham, Choi, Yang et al.
Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.
academic

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Basic Information

  • Paper ID: 2506.07356
  • Title: Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
  • Authors: Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim (Korea Advanced Institute of Science and Technology)
  • Classification: cs.CL (Computation and Language)
  • Publication Date: October 11, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2506.07356

Abstract

With major AI providers such as Google and OpenAI launching fine-tuning-as-a-service (FaaS), users can customize large language models (LLMs) using their own data. However, when user data contains harmful prompts, the service is vulnerable to safety degradation, a threat known as harmful finetuning attacks. Existing methods attempt to mitigate this by first constructing safety-aligned models and then fine-tuning on user data. However, this paper reveals that safety-aligned weights provide weak initialization for downstream task learning, resulting in suboptimal safety alignment and downstream task performance. To address this issue, the authors propose a Refusal-Teacher (Ref-Teacher) guided finetuning framework that directly fine-tunes the base model under the guidance of a safety-aligned Ref-Teacher, achieving dual improvements in safety and performance by filtering harmful prompts in user data and distilling safety alignment knowledge into the base model.

Research Background and Motivation

Problem Definition

  1. Harmful Finetuning Attacks: When users upload data containing harmful content for fine-tuning in FaaS, the model's safety alignment is compromised, causing the model to generate harmful content.
  2. Limitations of Existing Methods:
    • Traditional two-stage pipelines (safety alignment followed by fine-tuning) have fundamental flaws
    • Safety-aligned models provide weak weight initialization for downstream task learning
    • This results in limited task performance and compromised safety
  3. Research Motivation:
    • Direct fine-tuning on base models with both user data and safety alignment data can achieve better performance
    • However, this approach produces gradient conflicts, particularly exacerbated when user data contains harmful prompts
    • A new framework is needed to mitigate gradient conflicts while maintaining safety and task performance

Core Contributions

  1. Identified fundamental limitations of safety-aligned models: Demonstrated that safety-aligned LLMs provide weak initialization for downstream learning, leading to suboptimal task performance and safety compromises.
  2. Proposed Ref-Teacher-guided finetuning framework: Mitigates gradient conflicts through two mechanisms—alignment distillation and data filtering—achieving dual improvements in safety and task performance.
  3. Comprehensive experimental validation: Demonstrated the effectiveness and robustness of the method across multiple settings (varying harmful prompt ratios, data scales, dataset types, and model architectures).
  4. Practical FaaS solution: Provides a practically viable solution for safe and reliable LLM deployment.

Methodology Details

Task Definition

Input: Base LLM, user data (potentially containing harmful prompts), safety alignment data Output: Customized model that maintains safety alignment while performing well on user-specific tasks Constraints: Maintain robustness under harmful finetuning attacks

Model Architecture

1. Teacher Preparation Stage

Train the Ref-Teacher model to:

  • Generate soft refusal labels for alignment distillation
  • Effectively distinguish harmful and benign prompts using refusal features

Refusal Feature Definition:

R^l = (1/N_us) ∑(i=1 to N_us) f^l(x_us_i) - (1/N_s) ∑(i=1 to N_s) f^l(x_s_i)

Training Objective:

L_teacher = (1/N) ∑(i=1 to N) [ℓ(x_s_i, y_s_i) + ℓ(x_us_i, y_r_i) + λ{||1 + CS(f^l(x_s_i), R^l)||_2 + ||1 - CS(f^l(x_us_i), R^l)||_2}]

2. Finetuning Stage

Ref-Teacher guides the base model through two complementary mechanisms:

Data Filtering:

ω_i = {0, if CS(R^l, f^l(x_i)) > τ
       1, otherwise}

Alignment Distillation: Transfers soft label knowledge from Ref-Teacher to the student model using KL divergence loss

Overall Objective Function:

L_ft = (1/N_user) ∑(i=1 to N_user) ω_i * ℓ(x_i, y_i) + αT^2 * (1/N_align) ∑(i=1 to N_align) KL(p_Tt,i || p_Ts,i)

Technical Innovations

  1. Enhanced Refusal Features: Through regularization terms, strengthens the discriminative ability of refusal features, making the cosine similarity between harmful prompt features and refusal features approach 1, and benign prompts approach -1.
  2. Dynamic Refusal Feature Updates: Periodically updates refusal features during training, eliminating the need for pre-aligned models.
  3. Synergistic Dual Mechanisms: Alignment distillation provides smooth loss surfaces while data filtering removes harmful data, with both mechanisms cooperatively mitigating gradient conflicts.

Experimental Setup

Datasets

  • Safety Alignment Data: BeaverTails (5,000 harmful prompts + refusal responses) + Alpaca (5,000 benign prompts + helpful responses)
  • User Data: GSM8K, SST2, AGNEWS, AlpacaEval, etc., with harmful prompts mixed in at varying ratios
  • Evaluation Data: BeaverTails test set (1,000 samples) for safety assessment

Evaluation Metrics

  • Harm Score (HS): Proportion of harmful responses among 1,000 outputs (↓ lower is better)
  • Finetuning Accuracy (FA): Accuracy on downstream tasks (↑ higher is better)

Baseline Methods

  • Alignment-stage methods: RepNoise, Vaccine, Booster
  • Finetuning-stage methods: LDIFS, Lisa
  • Baseline method: SFT (standard supervised finetuning)

Implementation Details

  • Models: Llama3-8B, Gemma2-9B, Qwen2-7B
  • Training: LoRA finetuning (rank=32), AdamW optimizer
  • Hyperparameters: λ=0.1, α=0.1, T=1, τ=0.9, learning rate 5e-4 (teacher)/1e-5 (finetuning)

Experimental Results

Main Results

Performance Under Varying Harmful Prompt Ratios

Methodp=0p=0.1p=0.3p=0.5Avg HSAvg FA
SFT2.216.257.371.336.839.5
Vaccine1.35.435.057.524.822.0
Ref-Teacher0.91.00.60.90.947.1

Ablation Studies

Gradient Conflict Analysis

MethodAlignment DistillationData FilteringConflict Frequency (%)Avg Cosine Similarity
Base Method35.090.110
+Alignment Distillation32.260.131
+Data Filtering36.110.102
Complete Method30.020.140

Component Contribution Analysis

  • Alignment Distillation Only: HS=2.2, FA=46.2 (cannot independently address harmful data)
  • Data Filtering Only: HS=0.6, FA=46.5 (reduces harm but impacts task performance)
  • Complete Method: HS=0.5, FA=49.0 (synergistic combination achieves best performance)

Generalization Experiments

Cross-Dataset Generalization

Average performance on GSM8K, SST2, AGNEWS, AlpacaEval:

  • Ref-Teacher: HS=1.1, FA=52.8 (best)
  • Best baseline (Booster): HS=10.0, FA=51.3

Cross-Model Architecture Generalization

Average performance on Llama3-8B, Gemma2-9B, Qwen2-7B:

  • Ref-Teacher: HS=0.8, FA=60.8 (best)
  • Best baseline (Booster): HS=4.4, FA=57.3

Classification Performance Verification

Ref-Teacher's F1 scores on harmful content detection:

  • BeaverTails: 93.4%
  • JailbreakBench: 79.8%
  • GCG attacks: 92.9%
  • AutoDAN attacks: 82.1%

LLM Safety Research

  • Training-time Defense: Enhancing robustness through adversarial training, data balancing, and other methods
  • Inference-time Defense: Leveraging LLM self-assessment of harmfulness or internal discrepancies for protection

Harmful Finetuning Attack Defense

  1. Alignment-stage Solutions: Obtaining robust safety-aligned weights through regularization techniques
  2. Finetuning-stage Solutions: Freezing critical parameters or incorporating safety regularization
  3. Post-finetuning Solutions: Analyzing discrepancies and editing model weights to compensate for safety degradation

The main distinction of this work from existing approaches is direct finetuning of base models rather than safety-aligned models, mitigating gradient conflicts through teacher guidance.

Conclusions and Discussion

Main Conclusions

  1. Safety-aligned weights are insufficient: Safety-aligned models provide weak initialization for downstream tasks, resulting in dual losses in performance and safety
  2. Direct finetuning is more effective: Simultaneous safety alignment and task learning on base models achieves better results
  3. Gradient conflict is the key challenge: Requires synergistic mitigation through alignment distillation and data filtering
  4. Strong practicality: The method demonstrates stable performance across multiple settings, suitable for FaaS deployment

Limitations

  1. Dependence on refusal features: If refusal features are compromised by adversarial attacks, the entire framework's safety may be jeopardized
  2. Computational overhead: Requires additional Ref-Teacher model training, increasing computational costs
  3. Data quality dependency: Method effectiveness depends on the quality and coverage of safety alignment data

Future Directions

  1. Robustness Enhancement: Research defense methods against adversarial refusal feature manipulation
  2. Efficiency Optimization: Explore more efficient teacher training and knowledge distillation strategies
  3. Theoretical Analysis: Deepen understanding of the mathematical nature of gradient conflicts and mitigation mechanisms

In-Depth Evaluation

Strengths

  1. Deep Problem Discovery: First systematic identification of fundamental limitations in safety-aligned weights, providing new perspectives for the field
  2. Clever Method Design: Elegantly solves gradient conflict problems through refusal feature and dual mechanism design
  3. Comprehensive Experiments: Covers multiple settings, datasets, and models with rigorous experimental design and convincing results
  4. High Practical Value: Directly addresses FaaS scenarios with strong practical application value

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of gradient conflict phenomena and mitigation mechanisms
  2. Computational Cost Considerations: Insufficient discussion of computational overhead from additional Ref-Teacher training
  3. Limited Attack Models: Primarily considers data poisoning attacks; robustness against more complex adversarial attacks requires verification
  4. Hyperparameter Sensitivity: While ablation studies are provided, sensitivity analysis of critical hyperparameters is insufficient

Impact

  1. Academic Contribution: Provides new research paradigm for LLM safety finetuning, potentially inspiring subsequent research
  2. Industrial Value: Directly solves practical FaaS security problems with significant commercial application prospects
  3. Reproducibility: Provides detailed experimental settings and hyperparameters, facilitating reproduction and improvement

Applicable Scenarios

  1. FaaS Platforms: Safety assurance for AI service providers' finetuning services
  2. Customized LLMs: Safety solutions for enterprise internal LLM customization and deployment
  3. Multi-task Learning: LLM training scenarios requiring simultaneous optimization of multiple objectives
  4. Safety-Critical Applications: LLM application domains with high safety requirements

References

This paper cites important works in LLM safety, harmful finetuning attacks, and knowledge distillation, providing comprehensive literature foundation for related research. Particularly noteworthy are works on refusal features (Arditi et al. 2024) and existing harmful finetuning defense methods (Huang et al. 2024 series, Rosati et al. 2024, etc.).