2025-11-29T13:22:19.384327

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Xue, Mirzasoleiman

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs--with safety levels comparable to full-model fine-tuning--without compromising their reasoning abilities. Our ablation studies further identify three key factors in LoRA: (1) rank-$1$ updates are sufficient to achieve the best reasoning and safety performance, (2) the up projection layers are the most critical modules, with LoRA applied to them alone achieving even better results, and (3) middle layers are more effective than early or late layers. Together, these findings show that strong safety and reasoning can be achieved at minimal computational cost when updates are applied in the right places. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. Finally, while our attempts to further reduce this overlap yield only modest improvements on some tasks, they highlight the potential of developing methods that more reliably optimize the reasoning-safety tradeoff.

academic

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Basic Information

Paper ID: 2507.17075
Title: LoRA is All You Need for Safety Alignment of Reasoning LLMs
Authors: Yihao Xue, Baharan Mirzasoleiman (UCLA)
Category: cs.AI
Publication Date: July 2025 (arXiv v3: October 24, 2025)
Paper Link: https://arxiv.org/abs/2507.17075
Code Link: https://github.com/YihaoXue/lora-safety-reasoning

Abstract

Large language models with strong reasoning capabilities have achieved significant breakthroughs in solving complex problems, yet safety alignment fine-tuning often severely damages their reasoning abilities—a phenomenon known as the "Safety Tax." This paper demonstrates that supervised fine-tuning (SFT) using LoRA on rejection datasets can effectively achieve safety alignment without compromising reasoning capabilities. This is because constraining safety weight updates to a low-rank space minimizes interference with reasoning weights. Extensive experiments across four benchmarks in mathematics, science, and programming show that the resulting models achieve safety levels comparable to full model fine-tuning while maintaining strong reasoning abilities. Ablation studies further reveal that: (1) rank-1 updates are sufficient for optimal reasoning-safety trade-offs; (2) the up projection layer is the most critical module; (3) middle layers are more effective than early or late layers.

Research Background and Motivation

Core Problems

Safety Risks of Reasoning Models: Large language models with reasoning capabilities (such as DeepSeek-R1 series) tend to lose their original safety alignment after reasoning fine-tuning, even when the starting model was already safety-aligned.
"Safety Tax" Phenomenon: Subsequent safety alignment fine-tuning, while improving safety, significantly reduces the model's reasoning abilities. Even incorporating chain-of-thought (CoT) style reasoning in safety fine-tuning datasets cannot fully preserve reasoning capabilities.

Problem Significance

Reasoning capability represents a major breakthrough in modern LLMs, enabling them to solve previously intractable complex problems
Safety alignment is a necessary condition for model deployment, ensuring models do not assist with harmful requests
The trade-off between reasoning and safety directly impacts model practical value

Limitations of Existing Methods

Instruction Fine-tuning Safety Protection Methods Are Inapplicable:
- Data filtering methods (e.g., Shen et al., 2024) are unsuitable because reasoning fine-tuning datasets are typically carefully curated and unlikely to contain unsafe content
- Methods restricting model updates (e.g., Hsu et al., 2024) are ineffective because acquiring reasoning capabilities requires longer training and larger weight updates
Full Model Fine-tuning Problems:
- Authors found that full model fine-tuning leads to high-rank weight changes (stable rank from 40 to 100), as shown in Figure 1
- These high-rank changes introduce numerous unnecessary modifications that interfere with reasoning-related weights

Research Motivation

Existing evidence suggests that safety-related behaviors in LLMs are typically controlled by few dominant directions:

In activation space: such as steering vectors (Panickssery et al., 2023) or refusal features (Arditi et al., 2024)
In weight space: safety-critical weights tend to lie in low-rank subspaces (Jain et al., 2024; Wei et al., 2024)

Therefore, the authors hypothesize that low-rank modifications may be sufficient to induce safety behaviors without altering the entire weight space.

Core Contributions

Proposes a Simple and Effective Solution: Demonstrates that using LoRA for safety alignment fine-tuning can achieve strong safety without compromising reasoning abilities, effectively circumventing the "Safety Tax."
Comprehensive Experimental Validation:
- Verification on 4 benchmarks (AIME, GPQA, HumanEval+, MBPP+)
- Coverage of mathematics, science, and programming domains
- Effectiveness on both 7B and 14B models
In-depth Ablation Studies revealing three key findings:
- Rank-1 Updates Are Sufficient: Achieving optimal reasoning-safety trade-offs with minimal cost configuration
- Up Projection Layer Is Most Critical: Updating only the up projection layer even outperforms updating the entire MLP
- Middle Layers Are Most Important: Updating 16 middle layers is typically sufficient
Weight Structure Analysis:
- Discovers that LoRA updates have smaller overlap with initial weights
- Explores methods to further reduce overlap, achieving modest improvements on certain tasks
Achieves "Three Birds with One Stone": Simultaneously attaining strong safety, strong reasoning ability, and computational efficiency

Methodology Details

Task Definition

Input: Reasoning-capable language model
Objective: Through safety alignment fine-tuning, enable the model to refuse harmful requests while maintaining reasoning capabilities
Constraint: Minimize interference with original reasoning weights

LoRA Core Principles

LoRA (Low-Rank Adaptation) modifies weights by injecting trainable low-rank matrices while keeping original weights frozen:

$W' = W + \Delta W, \quad \text{where} \quad \Delta W = \frac{\alpha}{r}BA$

Where:

$B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are trainable low-rank matrices
$r \ll \min(d, k)$ is the rank
$\frac{\alpha}{r}$ is the scaling factor, with $\alpha$ as a hyperparameter

Method Advantages Analysis

Low-Rank Constraint: Restricts updates to a low-rank subspace, significantly reducing interference with original weights
Alignment with Safety Mechanisms:
- Safety behaviors are typically controlled by single or few directions
- Low-rank modifications are sufficient for safety alignment
- Avoids high-rank, unnecessary changes in full model fine-tuning
Computational Efficiency:
- Substantial reduction in parameter count
- Significant decrease in training cost and memory usage

Training Strategy

Full Model Fine-tuning Baseline:

Training for 5 epochs
All parameters updated through standard gradient optimization

LoRA Fine-tuning:

Training for 10 epochs
Only low-rank matrices B and A are updated
Default configuration: Applied only to MLP layers, rank r=1

Experimental Setup

Models

DeepSeek-R1-Distill-Qwen-7B: 7B parameter reasoning model
DeepSeek-R1-Distill-Qwen-14B: 14B parameter reasoning model
Llama-Guard-3-8B: For safety evaluation, proven by Jiang et al. (2025) to be the strongest safety evaluator

Datasets

Safety Fine-tuning Dataset:

DirectRefusal: Adapted from Rosati et al. (2024), adjusted by Huang et al. (2025)
Contains refusal responses paired with harmful requests
Each response includes brief reasoning ("I should not answer this question!") + refusal

Safety Evaluation Dataset:

StrongREJECT (Souly et al., 2024): 310 policy-violating queries

Reasoning Benchmarks:

AIME 2024: American Invitational Mathematics Examination, evaluating mathematical reasoning
GPQA-diamond (Rein et al., 2024): Graduate-level science questions
HumanEval+ (Chen et al., 2021 + Liu et al., 2023): Enhanced version of code generation benchmark
MBPP+ (Austin et al., 2021 + Liu et al., 2023): Enhanced version of code generation benchmark

Evaluation Metrics

Safety:

Uses Llama-Guard-3-8B to judge whether model responses are harmful
Safety Score: Proportion of questions where model responses are judged as harmful (lower is better)

Reasoning Ability:

Pass@1: Sample n=8 responses for each question, calculate proportion of correct responses, then average across all questions
AIME uses Qwen2.5-32B-Instruct as judge
GPQA uses regex matching (multiple choice)
HumanEval+ and MBPP+ use code execution tests

Implementation Details

7B Model:

Full model fine-tuning: 4 GPUs, batch size=2 per device, 5 epochs
LoRA fine-tuning: 2 GPUs, batch size=2 per device, 10 epochs
LoRA parameters: α=16, dropout=0.05

14B Model:

Full model fine-tuning: 8 GPUs, batch size=1 per device, 5 epochs
LoRA fine-tuning: 4 GPUs, batch size=2 per device, 10 epochs
LoRA parameters: α=16, dropout=0.05

Common Settings:

Learning rate: 5e-5
Weight decay: 1e-4
Save and evaluate checkpoints per epoch
Generation temperature: 0.6, top-p: 0.95, max tokens: 32,768

Experimental Results

Main Results (LoRA Circumvents "Safety Tax")

Figure 2 shows performance of different checkpoints (epochs) on reasoning performance and safety:

7B Model:

Base Model: High accuracy but low safety
Full Model Fine-tuning: Good safety but significant accuracy reduction (clear safety tax)
LoRA Fine-tuning: Strong performance in both reasoning and safety
- Best LoRA checkpoint outperforms base model on all tasks
- Safety slightly lower than full model fine-tuning (average decrease ~0.03)

14B Model:

LoRA fine-tuning shows small but consistent reasoning accuracy decrease compared to base model
Safety performance comparable to full model fine-tuning
Forms Pareto frontier in upper-right corner of reasoning-safety plane

Key Finding: LoRA achieves the ideal combination of "reasoning ability close to base model + safety close to full model fine-tuning."

Ablation Experiments

1. Impact of Rank (Figure 3)

Testing different rank values (r=1, 4, 8, 64) and full model fine-tuning on 14B model:

Reasoning Performance:

Generally decreases with increasing r
Smaller decrease between r=1 and r=8
Full model fine-tuning (full rank) performs worst

Safety Performance:

Significant decrease when r increases from 4 to 64
Full model fine-tuning safety score superior to r=64
Hypothesis: Medium-high rank may have optimization difficulties, while very low or full rank settings optimize more easily

Pareto Frontier Analysis (Figure 3c):

r=1 achieves best trade-off on AIME
r=1 near-optimal on GPQA
Demonstrates strong performance achievable at minimal fine-tuning cost

Theoretical Explanation: r=1 is sufficient to reflect the low-rank nature of the safety alignment task itself, consistent with prior research showing single-direction control of safety behaviors.

2. Impact of Modules

MLP vs. Attention Layers (Figure 4):

Pareto frontier of applying only to MLP layers similar to applying to both attention and MLP
Conclusion: Updating only MLP layers is sufficient

MLP Internal Projection Layers (Figure 5): Testing gate, up, and down projection layers in Qwen's SwiGLU structure:

Up Projection Most Critical:
- Pareto frontier of updating only up projection comparable to updating entire MLP
- Even outperforms updating entire MLP on HumanEval+ and MBPP+
Down Projection Performs Worst
Conclusion: Different projection layers contribute differently to reasoning-safety trade-off, with up projection particularly important and sufficient when used alone

3. Impact of Layers (Figure 6)

In 48-layer 14B model, updating only 16 layers, testing three configurations:

Early Layers (5-20)
Middle Layers (17-32)
Late Layers (25-40)

Results:

Middle Layers Achieve Best Trade-off:
- Comparable to updating all layers on AIME and GPQA
- Slightly inferior to updating all layers on HumanEval+ and MBPP+
Early or late layers perform significantly worse

Connection to Prior Research:

Steering vectors (Panickssery et al., 2023)
Refusal features (Arditi et al., 2024)
These studies show that intermediate representations responsible for safety behaviors are most prominent in middle layers

Weight Structure Analysis

Overlap Between LoRA Updates and Initial Weights (Figure 7)

Defining four metrics to quantify overlap:

$\frac{\|W_I^\top \Delta W\|}{\|W_I\|\|\Delta W\|}$ : Matrix-level cosine similarity of column space
$\frac{\|U_{16}U_{16}^\top \Delta W\|}{\|\Delta W\|}$ : Projection onto top 16 principal directions of $W_I$
$\frac{\|W_I \Delta W^\top\|}{\|W_I\|\|\Delta W\|}$ : Row space similarity
$\frac{\|V_{16}V_{16}^\top \Delta W^\top\|}{\|\Delta W\|}$ : Row space projection

Comparing: Full model fine-tuning vs. LoRA (r=4, applied to attention and MLP)

Findings:

LoRA achieves smaller overlap in most modules (few exceptions)
More orthogonal in both column and row spaces
LoRA's safety-oriented updates use subspaces more separated from those used by original reasoning-related weights
Although overlap reduction sometimes modest, suggests LoRA updates cause less interference with reasoning components

Methods for Further Reducing Overlap (Figure 8)

Two Approaches:

Regularization:
- reg-col: Add penalty term $\beta(\frac{\|W_I^\top \Delta W\|}{\|W_I\|\|\Delta W\|})^2$ during training
- reg-both: Simultaneously penalize column and row space overlap
- Setting β=1
Post-hoc Orthogonalization (OrthoMerge):
- OrthoMerge-col: $\Delta W \leftarrow (I - U_k U_k^\top)\Delta W$
- OrthoMerge-both: $\Delta W \leftarrow \lambda(I - U_k U_k^\top)\Delta W(I - V_k V_k^\top)$
- Using scaling factor λ to compensate for safety loss
- Testing λ ∈ {1, 1.15, 1.75, 1.2, 1.25}, k=64

Results:

"both" variants outperform "col" variants
OrthoMerge-both most promising:
- Strictly superior to vanilla LoRA on AIME and GPQA
- Slightly superior on MBPP+
- Slightly inferior on HumanEval+
Overall improvements modest and inconsistent, suggesting need for more refined approaches

Fine-tuning Safety-Aligned Models

Problem: Instruction fine-tuning causes safety degradation (Qi et al., 2023; Hsiung et al., 2025)
Solutions:
- Data filtering (Shen et al., 2024; Choi et al., 2024)
- Injecting safety samples (Bianchi et al., 2023)
- Leveraging guardrail models (Peng et al., 2025)
- Importance of prompt templates (Lyu et al., 2024)
- Algorithmic methods: projecting to "safety subspace" (Hsu et al., 2024), regularization (Mukhoti et al., 2023)
Limitations: Inapplicable to reasoning models, as reasoning capability requires longer training and larger weight updates

Safety Alignment After Fine-tuning

Methods: SFT and/or RL (Wei et al., 2021; Ouyang et al., 2022; Rafailov et al., 2023)
Problem: "Safety Tax" phenomenon (Huang et al., 2025)
- Safety alignment significantly damages reasoning performance
- Even incorporating CoT reasoning in safety fine-tuning datasets cannot fully preserve reasoning ability (Jiang et al., 2025)

This Paper's Contribution

Demonstrates that simply applying LoRA can effectively align reasoning models without performance degradation, filling a gap in existing literature.

Conclusions and Discussion

Main Conclusions

LoRA is an Effective Solution for Safety Alignment of Reasoning LLMs:
- Achieves safety comparable to full model fine-tuning
- Maintains reasoning ability close to original model
- Effectively circumvents "Safety Tax"
Minimal Configuration Guidelines:
- Rank-1 Is Sufficient: Achieving optimal trade-off with minimal cost
- Update Only Up Projection Layer: Even outperforms updating entire MLP
- Focus on Middle Layers: 16 middle layers typically sufficient
Mechanistic Insights:
- LoRA updates have smaller overlap with initial weights
- Low-rank constraint minimizes interference with reasoning weights
- Consistent with theory that safety behaviors are controlled by low-dimensional directions

Limitations

Residual Performance Gap:
- 14B model still shows small decreases on certain tasks (AIME, HumanEval+, MBPP+)
- Methods for further reducing overlap show limited and inconsistent improvements
Architectural Limitations:
- Experiments primarily on Qwen architecture
- Generalization to other LLM architectures needs validation
Insufficient Attention Layer Ablation:
- Primarily focuses on MLP layers
- Detailed ablation of attention layers left for future work
Mechanism Understanding:
- Why up projection is so effective requires deeper investigation
- Need more precise metrics to capture interference effects

Future Directions

Method Improvements:
- Develop more reliable methods for optimizing reasoning-safety trade-offs
- Better control of LoRA update subspace geometry
Architectural Extensions:
- Validate findings on other LLM architectures
- Study detailed ablation of attention layers
Theoretical Deepening:
- Deeper understanding of up projection effectiveness
- Develop more precise interference metrics
RL Alignment:
- Extend findings to RL-based safety alignment techniques
Application Exploration:
- Explore applications in other multi-objective balancing scenarios

In-Depth Evaluation

Strengths

Important and Practical Problem:
- Directly addresses critical challenges in reasoning LLM deployment
- "Safety Tax" is a genuine pain point in practical applications
- Broad practical value
Simple and Effective Method:
- Uses off-the-shelf LoRA technology without complex modifications
- Easy to implement with strong reproducibility
- High computational efficiency, easy for practical deployment
Comprehensive and In-depth Experiments:
- Multiple model sizes (7B, 14B)
- Multiple domains (mathematics, science, programming)
- Four benchmarks with broad coverage
- Thorough ablation studies providing clear configuration guidelines
Deep Insights:
- Finding that rank-1 is sufficient is concise and powerful
- Up projection importance provides direction for future research
- Critical role of middle layers aligns with theory
- Weight overlap analysis provides mechanistic understanding
Clear Writing:
- Well-structured, logical flow
- Rich figures with good visualization
- Sufficient technical details for reproducibility

Weaknesses

Performance Gap Not Completely Eliminated:
- 14B model still shows small decreases on certain tasks
- Further optimization methods (OrthoMerge) show limited improvements
- Suggests problem not fully solved
Limited Architectural Coverage:
- Experiments only on Qwen architecture
- Generalization to other architectures (Llama, Mistral) unknown
- Limits universality of conclusions
Insufficient Mechanistic Explanation:
- Lacks deep analysis of why up projection is so important
- Causal relationship between weight overlap reduction and performance improvement unclear
- Needs more theoretical support
Insufficient Attention Layer Research:
- Primarily focuses on MLP, limited attention layer ablation
- May miss important findings
Evaluation Limitations:
- Safety evaluation relies on single evaluator (Llama-Guard-3-8B)
- Pass@1 metric may not be comprehensive
- Lacks human evaluation

Impact

Academic Contribution:
- Fills research gap in reasoning model safety alignment
- Provides clear practical guidelines
- Offers new perspective on LoRA's role in multi-objective optimization
- Expected to inspire follow-up research
Practical Value:
- Directly applicable to actual model deployment
- Reduces computational cost of safety alignment
- Improves usability of reasoning models
- Important reference for industry
Reproducibility:
- Code open-sourced (GitHub)
- Sufficient experimental details
- Uses public datasets and models
- Easy to verify and extend

Applicable Scenarios

Safety Alignment of Reasoning LLMs:
- Mathematical reasoning models (e.g., math problem-solving assistants)
- Scientific reasoning models (e.g., research assistants)
- Code generation models (e.g., programming assistants)
Resource-Constrained Environments:
- Scenarios requiring low-cost fine-tuning
- Memory-limited deployment environments
- Rapid iteration development processes
Multi-objective Optimization Scenarios:
- Fine-tuning tasks requiring objective balancing
- Adding capabilities while preserving original abilities
- Domain adaptation without compromising general capability
Inapplicable Scenarios:
- Critical applications requiring complete performance gap elimination
- Non-Qwen architecture models (needs validation)
- Fundamental transformations requiring large parameter updates

References

Key Citations:

Huang et al., 2025: "Safety Tax: Safety alignment makes your large reasoning models less reasonable" - First systematic description of "Safety Tax" phenomenon
Jiang et al., 2025: "SafeChain: Safety of language models with long chain-of-thought reasoning capabilities" - Reports safety risks of reasoning models
Hu et al., 2022: "LoRA: Low-Rank Adaptation of Large Language Models" - Original LoRA paper
Panickssery et al., 2023: "Steering llama 2 via contrastive activation addition" - Steering vectors research
Arditi et al., 2024: "Refusal in language models is mediated by a single direction" - Refusal features research
Jain et al., 2024: "What makes and breaks safety fine-tuning? a mechanistic study" - Mechanistic study of safety fine-tuning
Wei et al., 2024: "Assessing the brittleness of safety alignment via pruning and low-rank modifications" - Brittleness of safety alignment research

Overall Assessment: This is a high-quality research paper addressing the important problem of safety alignment for reasoning LLMs with a simple and effective solution. While having some limitations (such as incomplete performance gap elimination and limited architectural coverage), its core contributions are solid, experiments comprehensive, and insights deep, with significant value for both academia and industry. Particularly, the three findings—rank-1 sufficiency, up projection criticality, and middle layer importance—provide clear guidance for future research and practical applications.