Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs--with safety levels comparable to full-model fine-tuning--without compromising their reasoning abilities. Our ablation studies further identify three key factors in LoRA: (1) rank-$1$ updates are sufficient to achieve the best reasoning and safety performance, (2) the up projection layers are the most critical modules, with LoRA applied to them alone achieving even better results, and (3) middle layers are more effective than early or late layers. Together, these findings show that strong safety and reasoning can be achieved at minimal computational cost when updates are applied in the right places. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. Finally, while our attempts to further reduce this overlap yield only modest improvements on some tasks, they highlight the potential of developing methods that more reliably optimize the reasoning-safety tradeoff.
- Paper ID: 2507.17075
- Title: LoRA is All You Need for Safety Alignment of Reasoning LLMs
- Authors: Yihao Xue, Baharan Mirzasoleiman (UCLA)
- Category: cs.AI
- Publication Date: July 2025 (arXiv v3: October 24, 2025)
- Paper Link: https://arxiv.org/abs/2507.17075
- Code Link: https://github.com/YihaoXue/lora-safety-reasoning
Large language models with strong reasoning capabilities have achieved significant breakthroughs in solving complex problems, yet safety alignment fine-tuning often severely damages their reasoning abilities—a phenomenon known as the "Safety Tax." This paper demonstrates that supervised fine-tuning (SFT) using LoRA on rejection datasets can effectively achieve safety alignment without compromising reasoning capabilities. This is because constraining safety weight updates to a low-rank space minimizes interference with reasoning weights. Extensive experiments across four benchmarks in mathematics, science, and programming show that the resulting models achieve safety levels comparable to full model fine-tuning while maintaining strong reasoning abilities. Ablation studies further reveal that: (1) rank-1 updates are sufficient for optimal reasoning-safety trade-offs; (2) the up projection layer is the most critical module; (3) middle layers are more effective than early or late layers.
- Safety Risks of Reasoning Models: Large language models with reasoning capabilities (such as DeepSeek-R1 series) tend to lose their original safety alignment after reasoning fine-tuning, even when the starting model was already safety-aligned.
- "Safety Tax" Phenomenon: Subsequent safety alignment fine-tuning, while improving safety, significantly reduces the model's reasoning abilities. Even incorporating chain-of-thought (CoT) style reasoning in safety fine-tuning datasets cannot fully preserve reasoning capabilities.
- Reasoning capability represents a major breakthrough in modern LLMs, enabling them to solve previously intractable complex problems
- Safety alignment is a necessary condition for model deployment, ensuring models do not assist with harmful requests
- The trade-off between reasoning and safety directly impacts model practical value
- Instruction Fine-tuning Safety Protection Methods Are Inapplicable:
- Data filtering methods (e.g., Shen et al., 2024) are unsuitable because reasoning fine-tuning datasets are typically carefully curated and unlikely to contain unsafe content
- Methods restricting model updates (e.g., Hsu et al., 2024) are ineffective because acquiring reasoning capabilities requires longer training and larger weight updates
- Full Model Fine-tuning Problems:
- Authors found that full model fine-tuning leads to high-rank weight changes (stable rank from 40 to 100), as shown in Figure 1
- These high-rank changes introduce numerous unnecessary modifications that interfere with reasoning-related weights
Existing evidence suggests that safety-related behaviors in LLMs are typically controlled by few dominant directions:
- In activation space: such as steering vectors (Panickssery et al., 2023) or refusal features (Arditi et al., 2024)
- In weight space: safety-critical weights tend to lie in low-rank subspaces (Jain et al., 2024; Wei et al., 2024)
Therefore, the authors hypothesize that low-rank modifications may be sufficient to induce safety behaviors without altering the entire weight space.
- Proposes a Simple and Effective Solution: Demonstrates that using LoRA for safety alignment fine-tuning can achieve strong safety without compromising reasoning abilities, effectively circumventing the "Safety Tax."
- Comprehensive Experimental Validation:
- Verification on 4 benchmarks (AIME, GPQA, HumanEval+, MBPP+)
- Coverage of mathematics, science, and programming domains
- Effectiveness on both 7B and 14B models
- In-depth Ablation Studies revealing three key findings:
- Rank-1 Updates Are Sufficient: Achieving optimal reasoning-safety trade-offs with minimal cost configuration
- Up Projection Layer Is Most Critical: Updating only the up projection layer even outperforms updating the entire MLP
- Middle Layers Are Most Important: Updating 16 middle layers is typically sufficient
- Weight Structure Analysis:
- Discovers that LoRA updates have smaller overlap with initial weights
- Explores methods to further reduce overlap, achieving modest improvements on certain tasks
- Achieves "Three Birds with One Stone": Simultaneously attaining strong safety, strong reasoning ability, and computational efficiency
- Input: Reasoning-capable language model
- Objective: Through safety alignment fine-tuning, enable the model to refuse harmful requests while maintaining reasoning capabilities
- Constraint: Minimize interference with original reasoning weights
LoRA (Low-Rank Adaptation) modifies weights by injecting trainable low-rank matrices while keeping original weights frozen:
W′=W+ΔW,whereΔW=rαBA
Where:
- B∈Rd×r and A∈Rr×k are trainable low-rank matrices
- r≪min(d,k) is the rank
- rα is the scaling factor, with α as a hyperparameter
- Low-Rank Constraint: Restricts updates to a low-rank subspace, significantly reducing interference with original weights
- Alignment with Safety Mechanisms:
- Safety behaviors are typically controlled by single or few directions
- Low-rank modifications are sufficient for safety alignment
- Avoids high-rank, unnecessary changes in full model fine-tuning
- Computational Efficiency:
- Substantial reduction in parameter count
- Significant decrease in training cost and memory usage
Full Model Fine-tuning Baseline:
- Training for 5 epochs
- All parameters updated through standard gradient optimization
LoRA Fine-tuning:
- Training for 10 epochs
- Only low-rank matrices B and A are updated
- Default configuration: Applied only to MLP layers, rank r=1
- DeepSeek-R1-Distill-Qwen-7B: 7B parameter reasoning model
- DeepSeek-R1-Distill-Qwen-14B: 14B parameter reasoning model
- Llama-Guard-3-8B: For safety evaluation, proven by Jiang et al. (2025) to be the strongest safety evaluator
Safety Fine-tuning Dataset:
- DirectRefusal: Adapted from Rosati et al. (2024), adjusted by Huang et al. (2025)
- Contains refusal responses paired with harmful requests
- Each response includes brief reasoning ("I should not answer this question!") + refusal
Safety Evaluation Dataset:
- StrongREJECT (Souly et al., 2024): 310 policy-violating queries
Reasoning Benchmarks:
- AIME 2024: American Invitational Mathematics Examination, evaluating mathematical reasoning
- GPQA-diamond (Rein et al., 2024): Graduate-level science questions
- HumanEval+ (Chen et al., 2021 + Liu et al., 2023): Enhanced version of code generation benchmark
- MBPP+ (Austin et al., 2021 + Liu et al., 2023): Enhanced version of code generation benchmark
Safety:
- Uses Llama-Guard-3-8B to judge whether model responses are harmful
- Safety Score: Proportion of questions where model responses are judged as harmful (lower is better)
Reasoning Ability:
- Pass@1: Sample n=8 responses for each question, calculate proportion of correct responses, then average across all questions
- AIME uses Qwen2.5-32B-Instruct as judge
- GPQA uses regex matching (multiple choice)
- HumanEval+ and MBPP+ use code execution tests
7B Model:
- Full model fine-tuning: 4 GPUs, batch size=2 per device, 5 epochs
- LoRA fine-tuning: 2 GPUs, batch size=2 per device, 10 epochs
- LoRA parameters: α=16, dropout=0.05
14B Model:
- Full model fine-tuning: 8 GPUs, batch size=1 per device, 5 epochs
- LoRA fine-tuning: 4 GPUs, batch size=2 per device, 10 epochs
- LoRA parameters: α=16, dropout=0.05
Common Settings:
- Learning rate: 5e-5
- Weight decay: 1e-4
- Save and evaluate checkpoints per epoch
- Generation temperature: 0.6, top-p: 0.95, max tokens: 32,768
Figure 2 shows performance of different checkpoints (epochs) on reasoning performance and safety:
7B Model:
- Base Model: High accuracy but low safety
- Full Model Fine-tuning: Good safety but significant accuracy reduction (clear safety tax)
- LoRA Fine-tuning: Strong performance in both reasoning and safety
- Best LoRA checkpoint outperforms base model on all tasks
- Safety slightly lower than full model fine-tuning (average decrease ~0.03)
14B Model:
- LoRA fine-tuning shows small but consistent reasoning accuracy decrease compared to base model
- Safety performance comparable to full model fine-tuning
- Forms Pareto frontier in upper-right corner of reasoning-safety plane
Key Finding: LoRA achieves the ideal combination of "reasoning ability close to base model + safety close to full model fine-tuning."
Testing different rank values (r=1, 4, 8, 64) and full model fine-tuning on 14B model:
Reasoning Performance:
- Generally decreases with increasing r
- Smaller decrease between r=1 and r=8
- Full model fine-tuning (full rank) performs worst
Safety Performance:
- Significant decrease when r increases from 4 to 64
- Full model fine-tuning safety score superior to r=64
- Hypothesis: Medium-high rank may have optimization difficulties, while very low or full rank settings optimize more easily
Pareto Frontier Analysis (Figure 3c):
- r=1 achieves best trade-off on AIME
- r=1 near-optimal on GPQA
- Demonstrates strong performance achievable at minimal fine-tuning cost
Theoretical Explanation: r=1 is sufficient to reflect the low-rank nature of the safety alignment task itself, consistent with prior research showing single-direction control of safety behaviors.
MLP vs. Attention Layers (Figure 4):
- Pareto frontier of applying only to MLP layers similar to applying to both attention and MLP
- Conclusion: Updating only MLP layers is sufficient
MLP Internal Projection Layers (Figure 5):
Testing gate, up, and down projection layers in Qwen's SwiGLU structure:
- Up Projection Most Critical:
- Pareto frontier of updating only up projection comparable to updating entire MLP
- Even outperforms updating entire MLP on HumanEval+ and MBPP+
- Down Projection Performs Worst
- Conclusion: Different projection layers contribute differently to reasoning-safety trade-off, with up projection particularly important and sufficient when used alone
In 48-layer 14B model, updating only 16 layers, testing three configurations:
- Early Layers (5-20)
- Middle Layers (17-32)
- Late Layers (25-40)
Results:
- Middle Layers Achieve Best Trade-off:
- Comparable to updating all layers on AIME and GPQA
- Slightly inferior to updating all layers on HumanEval+ and MBPP+
- Early or late layers perform significantly worse
Connection to Prior Research:
- Steering vectors (Panickssery et al., 2023)
- Refusal features (Arditi et al., 2024)
- These studies show that intermediate representations responsible for safety behaviors are most prominent in middle layers
Defining four metrics to quantify overlap:
- ∥WI∥∥ΔW∥∥WI⊤ΔW∥: Matrix-level cosine similarity of column space
- ∥ΔW∥∥U16U16⊤ΔW∥: Projection onto top 16 principal directions of WI
- ∥WI∥∥ΔW∥∥WIΔW⊤∥: Row space similarity
- ∥ΔW∥∥V16V16⊤ΔW⊤∥: Row space projection
Comparing: Full model fine-tuning vs. LoRA (r=4, applied to attention and MLP)
Findings:
- LoRA achieves smaller overlap in most modules (few exceptions)
- More orthogonal in both column and row spaces
- LoRA's safety-oriented updates use subspaces more separated from those used by original reasoning-related weights
- Although overlap reduction sometimes modest, suggests LoRA updates cause less interference with reasoning components
Two Approaches:
- Regularization:
- reg-col: Add penalty term β(∥WI∥∥ΔW∥∥WI⊤ΔW∥)2 during training
- reg-both: Simultaneously penalize column and row space overlap
- Setting β=1
- Post-hoc Orthogonalization (OrthoMerge):
- OrthoMerge-col: ΔW←(I−UkUk⊤)ΔW
- OrthoMerge-both: ΔW←λ(I−UkUk⊤)ΔW(I−VkVk⊤)
- Using scaling factor λ to compensate for safety loss
- Testing λ ∈ {1, 1.15, 1.75, 1.2, 1.25}, k=64
Results:
- "both" variants outperform "col" variants
- OrthoMerge-both most promising:
- Strictly superior to vanilla LoRA on AIME and GPQA
- Slightly superior on MBPP+
- Slightly inferior on HumanEval+
- Overall improvements modest and inconsistent, suggesting need for more refined approaches
- Problem: Instruction fine-tuning causes safety degradation (Qi et al., 2023; Hsiung et al., 2025)
- Solutions:
- Data filtering (Shen et al., 2024; Choi et al., 2024)
- Injecting safety samples (Bianchi et al., 2023)
- Leveraging guardrail models (Peng et al., 2025)
- Importance of prompt templates (Lyu et al., 2024)
- Algorithmic methods: projecting to "safety subspace" (Hsu et al., 2024), regularization (Mukhoti et al., 2023)
- Limitations: Inapplicable to reasoning models, as reasoning capability requires longer training and larger weight updates
- Methods: SFT and/or RL (Wei et al., 2021; Ouyang et al., 2022; Rafailov et al., 2023)
- Problem: "Safety Tax" phenomenon (Huang et al., 2025)
- Safety alignment significantly damages reasoning performance
- Even incorporating CoT reasoning in safety fine-tuning datasets cannot fully preserve reasoning ability (Jiang et al., 2025)
Demonstrates that simply applying LoRA can effectively align reasoning models without performance degradation, filling a gap in existing literature.
- LoRA is an Effective Solution for Safety Alignment of Reasoning LLMs:
- Achieves safety comparable to full model fine-tuning
- Maintains reasoning ability close to original model
- Effectively circumvents "Safety Tax"
- Minimal Configuration Guidelines:
- Rank-1 Is Sufficient: Achieving optimal trade-off with minimal cost
- Update Only Up Projection Layer: Even outperforms updating entire MLP
- Focus on Middle Layers: 16 middle layers typically sufficient
- Mechanistic Insights:
- LoRA updates have smaller overlap with initial weights
- Low-rank constraint minimizes interference with reasoning weights
- Consistent with theory that safety behaviors are controlled by low-dimensional directions
- Residual Performance Gap:
- 14B model still shows small decreases on certain tasks (AIME, HumanEval+, MBPP+)
- Methods for further reducing overlap show limited and inconsistent improvements
- Architectural Limitations:
- Experiments primarily on Qwen architecture
- Generalization to other LLM architectures needs validation
- Insufficient Attention Layer Ablation:
- Primarily focuses on MLP layers
- Detailed ablation of attention layers left for future work
- Mechanism Understanding:
- Why up projection is so effective requires deeper investigation
- Need more precise metrics to capture interference effects
- Method Improvements:
- Develop more reliable methods for optimizing reasoning-safety trade-offs
- Better control of LoRA update subspace geometry
- Architectural Extensions:
- Validate findings on other LLM architectures
- Study detailed ablation of attention layers
- Theoretical Deepening:
- Deeper understanding of up projection effectiveness
- Develop more precise interference metrics
- RL Alignment:
- Extend findings to RL-based safety alignment techniques
- Application Exploration:
- Explore applications in other multi-objective balancing scenarios
- Important and Practical Problem:
- Directly addresses critical challenges in reasoning LLM deployment
- "Safety Tax" is a genuine pain point in practical applications
- Broad practical value
- Simple and Effective Method:
- Uses off-the-shelf LoRA technology without complex modifications
- Easy to implement with strong reproducibility
- High computational efficiency, easy for practical deployment
- Comprehensive and In-depth Experiments:
- Multiple model sizes (7B, 14B)
- Multiple domains (mathematics, science, programming)
- Four benchmarks with broad coverage
- Thorough ablation studies providing clear configuration guidelines
- Deep Insights:
- Finding that rank-1 is sufficient is concise and powerful
- Up projection importance provides direction for future research
- Critical role of middle layers aligns with theory
- Weight overlap analysis provides mechanistic understanding
- Clear Writing:
- Well-structured, logical flow
- Rich figures with good visualization
- Sufficient technical details for reproducibility
- Performance Gap Not Completely Eliminated:
- 14B model still shows small decreases on certain tasks
- Further optimization methods (OrthoMerge) show limited improvements
- Suggests problem not fully solved
- Limited Architectural Coverage:
- Experiments only on Qwen architecture
- Generalization to other architectures (Llama, Mistral) unknown
- Limits universality of conclusions
- Insufficient Mechanistic Explanation:
- Lacks deep analysis of why up projection is so important
- Causal relationship between weight overlap reduction and performance improvement unclear
- Needs more theoretical support
- Insufficient Attention Layer Research:
- Primarily focuses on MLP, limited attention layer ablation
- May miss important findings
- Evaluation Limitations:
- Safety evaluation relies on single evaluator (Llama-Guard-3-8B)
- Pass@1 metric may not be comprehensive
- Lacks human evaluation
- Academic Contribution:
- Fills research gap in reasoning model safety alignment
- Provides clear practical guidelines
- Offers new perspective on LoRA's role in multi-objective optimization
- Expected to inspire follow-up research
- Practical Value:
- Directly applicable to actual model deployment
- Reduces computational cost of safety alignment
- Improves usability of reasoning models
- Important reference for industry
- Reproducibility:
- Code open-sourced (GitHub)
- Sufficient experimental details
- Uses public datasets and models
- Easy to verify and extend
- Safety Alignment of Reasoning LLMs:
- Mathematical reasoning models (e.g., math problem-solving assistants)
- Scientific reasoning models (e.g., research assistants)
- Code generation models (e.g., programming assistants)
- Resource-Constrained Environments:
- Scenarios requiring low-cost fine-tuning
- Memory-limited deployment environments
- Rapid iteration development processes
- Multi-objective Optimization Scenarios:
- Fine-tuning tasks requiring objective balancing
- Adding capabilities while preserving original abilities
- Domain adaptation without compromising general capability
- Inapplicable Scenarios:
- Critical applications requiring complete performance gap elimination
- Non-Qwen architecture models (needs validation)
- Fundamental transformations requiring large parameter updates
Key Citations:
- Huang et al., 2025: "Safety Tax: Safety alignment makes your large reasoning models less reasonable" - First systematic description of "Safety Tax" phenomenon
- Jiang et al., 2025: "SafeChain: Safety of language models with long chain-of-thought reasoning capabilities" - Reports safety risks of reasoning models
- Hu et al., 2022: "LoRA: Low-Rank Adaptation of Large Language Models" - Original LoRA paper
- Panickssery et al., 2023: "Steering llama 2 via contrastive activation addition" - Steering vectors research
- Arditi et al., 2024: "Refusal in language models is mediated by a single direction" - Refusal features research
- Jain et al., 2024: "What makes and breaks safety fine-tuning? a mechanistic study" - Mechanistic study of safety fine-tuning
- Wei et al., 2024: "Assessing the brittleness of safety alignment via pruning and low-rank modifications" - Brittleness of safety alignment research
Overall Assessment: This is a high-quality research paper addressing the important problem of safety alignment for reasoning LLMs with a simple and effective solution. While having some limitations (such as incomplete performance gap elimination and limited architectural coverage), its core contributions are solid, experiments comprehensive, and insights deep, with significant value for both academia and industry. Particularly, the three findings—rank-1 sufficiency, up projection criticality, and middle layer importance—provide clear guidance for future research and practical applications.