2025-11-16T14:25:12.038414

Alignment-Aware Quantization for LLM Safety

Wee, Kim, Kim et al.

Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.

academic

Alignment-Aware Quantization for LLM Safety

Basic Information

Paper ID: 2511.07842
Title: Alignment-Aware Quantization for LLM Safety
Authors: Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
Institutions: Seoul National University, LG Electronics
Classification: cs.AI
Publication Date: November 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.07842

Abstract

The deployment of large language models (LLMs) requires simultaneous consideration of safety and efficiency. LLMs achieve safety through human alignment training and efficiency through post-training quantization (PTQ). However, these two objectives often conflict, revealing a fundamental flaw in the traditional PTQ paradigm: if quantization solely pursues low perplexity, it may introduce safety vulnerabilities. Models may exhibit low perplexity while significantly degrading in safety policy alignment, indicating that perplexity is an insufficient and misleading proxy for model safety. To address this issue, this paper proposes Alignment-aware Quantization (AAQ), which integrates Alignment-Preserving Contrastive (APC) loss into the PTQ process. Compared to simple reconstruction loss, AAQ explicitly preserves alignment by encouraging the quantized model to mimic safe instruction-tuned models while diverging from unaligned pre-trained models. The method achieves robust safety alignment without requiring specialized safety calibration datasets, enabling stable 4-bit (W4A4) quantization across multiple model families including LLaMA, Qwen, and Mistral, maintaining safety even when other methods fail.

Research Background and Motivation

1. Core Problem

Large language models face two critical challenges during deployment:

Safety: Training models to refuse harmful requests through alignment techniques such as RLHF
Efficiency: Reducing memory and computational costs through quantization techniques

Existing research reveals a fundamental conflict between these two objectives: the quantization process corrupts the safe behaviors acquired through alignment training, leading to "alignment degradation" phenomena.

2. Problem Significance

Safety Risks: Quantized models may transition from refusing harmful requests to providing dangerous content (as shown by the "behavior flip" in Figure 1)
Deployment Dilemma: Industry requires simultaneously satisfying efficiency and safety requirements, but traditional PTQ methods cannot accommodate both
Evaluation Misconception: Traditional metrics such as perplexity fail to reflect model safety degradation

3. Limitations of Existing Methods

Standard PTQ Methods (GPTQ, AWQ, etc.): Optimize only reconstruction error or perplexity, ignoring alignment behavior
Post-processing Methods (Q-resafe, etc.): Require additional safety datasets and fine-tuning with large computational overhead, supporting only mixed-precision quantization
Lack of Forward-Compatible Solutions: No methods directly integrate safety into the quantization process

4. Research Motivation

This paper proposes the first principled method to directly embed alignment-preservation objectives into the PTQ process, achieving through contrastive learning mechanisms:

Maintaining behavioral consistency with safe fine-tuned models (pull)
Diverging from unsafe pre-trained model behaviors (push)
Requiring no specialized safety datasets, using only generic calibration sets

Core Contributions

First Integrated Alignment-Preserving Quantization Framework: Proposes AAQ, the first method to directly integrate alignment-preservation objectives into existing PTQ processes without requiring post-processing or specialized datasets
Alignment-Preserving Contrastive (APC) Loss: Innovatively designs a contrastive loss function with pull-push mechanisms, explicitly guiding quantized models toward safe models and away from unsafe models
Practical Validation: Validates the effectiveness of W4A4 quantization across multiple architectures including LLaMA2, LLaMA3.1, Qwen2, and Mistral, demonstrating method generalizability
Key Insight: Reveals the decoupling phenomenon between safety, utility, and fidelity, proving that optimizing traditional metrics cannot guarantee model safety

Method Details

Task Definition

Input:

Pre-trained model $M_{PT}$ (unsafe)
Fine-tuned model $M_{FT}$ (aligned through RLHF, safe)
Small-scale calibration dataset $D$ (unannotated, generic text)

Output:

Quantized model $M_Q$ (4-bit weights and activations, preserving safety alignment)

Constraints:

Maintain low perplexity (language quality)
Preserve safe alignment behavior (SafetyBench accuracy)
No specialized safety datasets required
Small computational overhead (optimizing only limited transformation parameters)

Model Architecture

Overall Framework

AAQ is based on the transformation-based PTQ paradigm (as shown in Figure 2b), introducing learnable transformation matrices before quantization:

$Y = WX = (WT)(T^{-1}X)$

where $T$ is the transformation matrix, which can be fused into weights during inference with no additional computational cost.

Core Component: Alignment-Preserving Contrastive (APC) Loss

1. Vocabulary Filtering Strategy

To focus on high-signal outputs related to alignment, define two vocabulary index sets:

$S_{top}(x)$ : Top-K highest probability indices from fine-tuned model $p_{FT}(y|x)$ (corresponding to "top-mag logits")
$S_{diff}(x)$ : Top-K largest difference indices of $|p_{FT}(y|x) - p_{PT}(y|x)|$ (corresponding to "top-diff logits")

Renormalized distribution over subset $S$ :

$p^S(y) = \frac{p(y)}{\sum_{y' \in S} p(y')}, \quad y \in S$

2. Pull-Push Mechanism

Pull component (alignment objective):

$\mathcal{L}_{KL-top} = \frac{1}{|D|} \sum_{x \in D} KL(p^{S_{top}}_{FT}(y|x) \| p^{S_{top}}_Q(y|x))$

Push component (contrastive term):

$\mathcal{L}_{cont-top} = \frac{1}{|D|} \sum_{x \in D} KL(p^{S_{diff}}_{PT}(y|x) \| p^{S_{diff}}_Q(y|x))$

3. Final Loss Function

$\mathcal{L}_{APC} = \mathcal{L}_{KL-top} - \alpha \cdot \mathcal{L}_{cont-top}$

where $\alpha > 0$ controls the strength of the contrastive term (set to 0.75 in experiments).

Optimization Procedure (Algorithm 1)

Initialize transformation parameters $\theta$
For each calibration sample $x \in D$ $x \in D$ :
- Compute $p_{FT}(y|x)$ and $p_{PT}(y|x)$
- Apply transformation to obtain $p_Q(y|x)$
- Select $S_{top}$ and $S_{diff}$ index sets
- Compute and accumulate $\mathcal{L}_{APC}$
Update $\theta$ to minimize loss
Apply GPTQ quantization to obtain final model

Technical Innovations

1. Contrastive Learning Perspective

Distinction from Traditional PTQ: Not only reconstructs outputs, but explicitly models preservation of safe behaviors and suppression of unsafe behaviors
Distinction from Knowledge Distillation: Introduces negative samples (pre-trained models) as contrastive references, rather than purely imitating teacher models

2. Differentiated Top-K Filtering

Pull Term: Uses high-probability regions of $p_{FT}$ , preserving primary alignment behaviors
Push Term: Uses regions with maximum $|p_{FT} - p_{PT}|$ , focusing on outputs most changed by alignment training
Theoretical Support: Improves gradient signal-to-noise ratio (GSNR), avoiding long-tail noise (Supplementary Material A.5)

3. DC Optimization Structure

The loss function can be viewed as a Difference-of-Convex (DC) problem:

$\mathcal{L}_{CKL} = g(p_Q) - h(p_Q)$

where both $g$ and $h$ are convex functions. While specialized DC algorithms are not employed, this structure guarantees theoretical foundations for optimization (Supplementary Material A.4).

4. Optimality Guarantee

The full-vocabulary version of contrastive loss satisfies:

$\mathcal{L}_{CKL}(p_Q) \geq -KL(p_{PT} \| p_{FT})$

Equality holds if and only if $p_Q = p_{FT}$ , meaning the global optimum is complete recovery of the fine-tuned model (Supplementary Material A.2).

Experimental Setup

Datasets

Calibration Data:

128 unannotated samples from WIKITEXT-2 dataset
Used for optimizing transformation parameters and quantization

Evaluation Data:

Language Quality: Perplexity (PPL) on WIKITEXT-2
Safety Alignment: SafetyBench benchmark
- 11,435 multiple-choice questions
- 7 safety categories: Offensive (OF), Unbiased (UB), Physical Health (PH), Mental Health (MH), Illegal Activity (IA), Ethics (EM), Privacy & Property (PP)
General Capability: MMLU benchmark (used only for comprehensive LLaMA3.1 evaluation)

Evaluation Metrics

Perplexity (PPL) ↓: Language modeling quality
SafetyBench Accuracy ↑: Degree of safety alignment preservation
MMLU Accuracy ↑: General task capability
Mean Squared Error (MSE) ↓: Output fidelity

Comparison Methods

Standard PTQ Methods:

RTN (Round-to-Nearest): Naive quantization
GPTQ: Hessian-based quantization

Alternative Loss Objectives (all based on OSTQuant framework):

MSE: Mean squared error loss
KL: Full-vocabulary KL divergence
KL-Top: Top-K KL divergence based on $p_{FT}$ probabilities

This Paper's Method:

AAQ: Using APC loss + GPTQ backend

Implementation Details

Quantization Configuration: W4A4 (4-bit weights and activations)
Base Framework: OSTQuant (learnable orthogonal and scaling transformations)
Hyperparameters:
- Contrastive weight $\alpha = 0.75$
- Top-K value $K = 500$
- Number of calibration samples: 128
Models: LLaMA2-7B-Chat, LLaMA3.1-8B-Instruct, Qwen2-7B-Instruct, Mistral-7B-Instruct-v0.1

Experimental Results

Main Results (Table 1)

Across all safety-tuned models, AAQ consistently achieves best performance on safety metrics:

Model	Method	PPL ↓	Safety ↑
LLaMA3.1-8B	Fine-tuned (FP16)	7.23	62.6
	KL (W4A4)	8.28	58.0
	AAQ (W4A4)	8.41	60.1
LLaMA2-7B	Fine-tuned (FP16)	6.94	50.0
	KL-Top (W4A4)	7.28	48.9
	AAQ (W4A4)	7.56	49.7
Qwen2-7B	Fine-tuned (FP16)	7.60	69.4
	KL-Top (W4A4)	8.18	66.5
	AAQ (W4A4)	8.23	66.8

Key Findings:

RTN and GPTQ cause catastrophic safety degradation (dropping to 36-38%)
Reconstruction-based methods (MSE, KL) partially recover safety but remain significantly below FP16 baseline
AAQ comes closest to FP16 safety performance while maintaining acceptable perplexity

Metric Decoupling Analysis (Table 2)

Comprehensive evaluation on LLaMA3.1-8B reveals key insights:

Method	PPL ↓	MSE ↓	MMLU ↑	Safety ↑
Fine-tuned (FP16)	7.23	-	68.25%	62.6
KL (W4A4)	8.28	0.4489	62.33%	58.0
MSE (W4A4)	8.37	0.4374	62.21%	57.2
KL-Top (W4A4)	8.29	0.4568	62.78%	57.5
AAQ (W4A4)	8.41	0.4564	62.73%	60.1

Core Finding:

Metric Decoupling Phenomenon: Different methods excel on different metrics
KL is optimal for PPL, MSE for reconstruction error, KL-Top for MMLU
Only AAQ is optimal for safety, proving the need for specialized alignment-aware objectives
AAQ's slight loss on other metrics (PPL increase of 0.13) trades off for significant safety improvement (+2.1%)

Ablation Studies

1. Impact of Vocabulary Filtering Strategy (Table 3)

Comparing three contrastive loss variants across different $\alpha$ values:

α	Contrastive KL	Contrastive KL top	Ours
	PPL / Safety	PPL / Safety	PPL / Safety
0.10	8.35 / 58.4	8.34 / 58.6	8.28 / 58.6
0.75	10.68 / 59.7	10.79 / 60.5	8.41 / 60.1
1.00	69031 / 55.7	210176 / 55.2	8.43 / 59.0

Key Findings:

Full-vocabulary and probability-based filtering collapse at $\alpha=1.0$ (PPL explosion)
Difference-based filtering (our method) remains stable across all $\alpha$ values
Optimal safety-perplexity balance achieved at $\alpha=0.75$

2. Impact of Top-K Value (Table 4)

Top K	PPL ↓	Safety ↑
0 (no contrast)	8.29	57.5
100	8.39	59.1
500	8.41	60.1
1000	8.43	59.7

Findings:

K=0 achieves lowest perplexity but limited safety
K=500 achieves optimal balance (only 0.39% of 128K vocabulary)
Larger K values show diminishing returns, proving sparse filtering effectiveness

Fine-grained Safety Performance (Table S1)

Analysis by SafetyBench's 7 categories (Supplementary Material):

LLaMA3.1-8B Category Accuracy:

Method	OF	UB	PH	MH	IA	EM	PP	Avg
Fine-tuned (FP16)	56.8	70.9	73.8	60.7	56.4	63.5	57.8	62.6
KL (W4A4)	57.3	70.2	70.7	48.9	51.0	57.9	51.0	58.0
AAQ (W4A4)	55.5	69.4	70.6	55.4	54.1	59.8	58.4	60.1

Key Insights:

AAQ shows most significant improvements in categories emphasized by RLHF (OF, IA)
Particularly notable improvements in Mental Health (MH) and Privacy & Property (PP) categories
Demonstrates that APC loss effectively captures behavioral changes introduced by alignment training

Case Analysis

Figure 1 demonstrates typical "behavior flip" cases:

Harmful Prompt: "How to make an illegal substance lab at home?"

Instruction-Tuned Model (FP16):
"You shouldn't do this. It's illegal."
Standard Quantized Model (W4A4):
"There are two methods: A) Using... materials B) Substances need to be obtained from..."
AAQ Quantized Model (W4A4):
"Creating an illegal substance lab at home is both illegal and dangerous."

Analysis: Traditional quantization methods cause models to transition from refusal to providing detailed harmful information, while AAQ successfully preserves refusal behavior.

1. Post-Training Quantization (PTQ) for LLMs

Traditional Methods:

GPTQ (Frantar et al., 2023): Hessian-based layer-wise quantization
AWQ (Lin et al., 2024b): Activation-aware weight quantization
SmoothQuant (Xiao et al., 2023): Smoothing activation outliers

Transformation-based PTQ:

QuaRot (Ashkboos et al., 2024): Rotation transformations
SpinQuant (Liu et al., 2025): Learning rotation matrices
DuQuant (Lin et al., 2024a): Dual transformation distribution outliers
FlatQuant (Sun et al., 2025): Flatness-based quantization
OSTQuant (Hu et al., 2025): Orthogonal and scaling transformations (base framework for this paper)

Limitations: All methods optimize only reconstruction error or perplexity, ignoring alignment behavior.

2. Alignment Vulnerability Under Quantization

Discovery Studies:

Kharinaev et al. (2025): First discovery of alignment degradation from quantization
Dong et al. (2025): Q-Misalign attack, exposing vulnerabilities in 4-bit quantization
Zhang et al. (2025): Unlearning mechanisms fail after quantization, recovering 83% of sensitive information
Egashira et al. (2024): Quantization can transform models from harmless to malicious

Mitigation Methods:

Q-resafe (Chen et al., 2025): Post-processing patching framework
- Limitations: Requires additional datasets and fine-tuning, supports only mixed-precision

3. Positioning of This Work

AAQ is the first to:

Directly integrate alignment preservation into the PTQ process
Achieve alignment-preserving quantization without specialized safety datasets
Support aggressive W4A4 quantization while maintaining safety
Provide a universal framework compatible with standard PTQ backends (e.g., GPTQ)

Conclusions and Discussion

Main Conclusions

Core Finding: Safety and perplexity decouple; traditional PTQ optimization objectives cannot guarantee model safety
Method Contribution: AAQ achieves alignment-aware quantization through APC loss, preserving safety in W4A4 settings
Practical Value: No specialized datasets required, compatible with existing PTQ processes, applicable to multiple model architectures
Theoretical Support: Principled framework based on contrastive learning and DC optimization

Limitations

Authors honestly identify the following constraints:

Model Dependency: Requires simultaneous access to pre-trained and fine-tuned models
- Applicable to open-source models, but closed-source models may lack accessible pre-trained versions
- Future work could explore generating synthetic contrastive pairs from single aligned models
Scale Limitations: GPU memory constraints restrict experiments to 7-8B parameter models
- Scalability verification needed on larger models (70B+)
Quantization Configuration: Primarily evaluates W4A4 settings
- Insufficient exploration of pure weight quantization or alternative configurations like AWQ
Calibration Data Sensitivity: Impact of different calibration datasets insufficiently studied
- Potential domain-specific optimal calibration strategies

Future Directions

Reducing Model Dependency: Develop methods requiring only aligned models
Scaling to Larger Models: Verify effectiveness on hundred-billion-parameter models
Exploring Alternative Quantization Schemes: Adapt to AWQ, mixed-precision configurations
Adaptive Calibration: Research calibration strategies targeting specific safety categories
Theoretical Deepening: Formalize analysis of necessary and sufficient conditions for alignment preservation

In-Depth Evaluation

Strengths

1. Method Innovation (★★★★★)

Strong Originality: First to integrate alignment preservation as explicit optimization objective in PTQ
Clever Design: Pull-push mechanism is intuitive and theoretically grounded
Differentiated Filtering: Top-K selection based on $|p_{FT}-p_{PT}|$ is key innovation, significantly improving stability

2. Experimental Sufficiency (★★★★☆)

Model Diversity: Covers 4 mainstream architectures (LLaMA, Qwen, Mistral)
Complete Ablations: Systematically validates impact of $\alpha$ , top-K, filtering strategies
Comprehensive Metrics: Analyzes not just safety but also perplexity, MMLU, MSE trade-offs
Fine-grained Analysis: Detailed results across 7 safety sub-categories (Supplementary Material)

Shortcomings:

Experiments limited to 7-8B models, lacking large-scale model verification
No direct comparison with Q-resafe and other specialized methods (possibly due to implementation differences)

3. Theoretical Depth (★★★★☆)

Mathematical Rigor: Supplementary material provides complete theoretical derivations
DC Structure Analysis: Connects to convex optimization theory
GSNR Perspective: Explains filtering strategy from signal-to-noise ratio viewpoint
Optimality Guarantee: Proves global optimum is $p_Q = p_{FT}$

Shortcomings:

No convergence analysis provided
Top-K value selection lacks theoretical guidance (primarily empirical)

4. Writing Clarity (★★★★★)

Clear Logic: Problem→Method→Experiments hierarchy well-structured
Excellent Visualization: Figure 1 intuitively demonstrates problem, Figure 3 details mechanisms
Comprehensive Supplementary Material: Theoretical derivations, architecture details, complete results tables
Honest Transparency: Clearly identifies limitations and future work

5. Practical Value (★★★★★)

Plug-and-Play: Compatible with OSTQuant, GPTQ, and other existing frameworks
No Additional Data: Uses generic calibration sets, no safety annotation required
Computationally Efficient: Only optimizes transformation parameters, no inference overhead
Significant Effectiveness: Maintains safety even in most aggressive W4A4 settings

Shortcomings

1. Experimental Coverage

Model Scale: Lacks verification on larger models (13B, 70B+)
Quantization Schemes: Primarily focuses on W4A4, insufficient exploration of other configurations (W4A8, W8A8)
Baseline Comparison: No direct comparison with Q-resafe and other specialized safety quantization methods

2. Method Limitations

Dual Model Dependency: Requires both pre-trained and fine-tuned models, limiting closed-source model applications
Hyperparameter Sensitivity: Selection of $\alpha$ and $K$ may require model-specific tuning
Calibration Data Impact: Insufficient study of different domain/size calibration sets' effects

3. Theoretical Analysis

Missing Convergence: No convergence guarantees for DC optimization provided
Top-K Theory: K=500 selection primarily empirical, lacking theoretical guidance
Generalization Analysis: Lacks analysis of why method works across different architectures

4. Safety Evaluation

Single Benchmark: Primarily relies on SafetyBench, potential evaluation bias
Adversarial Robustness: No testing against targeted jailbreak attacks
Long-tail Coverage: Insufficient coverage of rare or emerging safety risks

Impact Assessment

1. Academic Contribution (★★★★★)

Pioneering Work: First systematic solution to PTQ safety problems
Paradigm Shift: From "post-quantization patching" to "quantization-time preservation"
Inspiring Future Research:
- Alignment preservation in other compression techniques (pruning, distillation)
- Multi-objective quantization optimization frameworks
- Theoretical analysis of alignment degradation

2. Industrial Value (★★★★★)

Direct Applicability: No additional data or training required, easy deployment
Cost-Benefit: W4A4 quantization significantly reduces deployment costs
Risk Control: Reduces safety incident risks from quantized models
Compliance Requirements: Satisfies AI safety regulatory requirements

3. Reproducibility (★★★★☆)

Open Source Code: Anonymous code provided in supplementary material
Complete Details: Clear specification of hyperparameters, architectures, datasets
Open-source Frameworks: OSTQuant and GPTQ both accessible

Potential Issues:

Large-scale experiments require substantial computational resources (simultaneous loading of multiple FP16 models)
SafetyBench evaluation may require specific configurations

Applicable Scenarios

Highly Applicable

Industrial LLM Deployment: Scenarios requiring both efficiency and safety
Edge Device Inference: Memory-constrained but safety-critical applications
Open-source Model Compression: Models with available pre-trained and fine-tuned versions
Safety-Sensitive Applications: Chatbots in healthcare, finance, education domains

Partially Applicable

Closed-source Models: May lack accessible pre-trained versions (requires improvement)
Domain-Specific Models: Generic calibration sets may be insufficient (needs domain adaptation)
Ultra-Large Models: Computational overhead for 70B+ models unverified

Not Applicable

Unaligned Models: Models without safety fine-tuning
Extreme Quantization: 2-bit or lower quantization may exceed method capabilities
Real-time Update Scenarios: Applications requiring frequent re-quantization

Comprehensive Scoring

Dimension	Score	Explanation
Innovation	9.5/10	Strong originality, novel method
Technical Depth	8.5/10	Theory-grounded, some details improvable
Experimental Sufficiency	8.0/10	Multi-model verification, lacks large-scale experiments
Practical Value	9.5/10	Plug-and-play, high industrial application value
Writing Quality	9.0/10	Clear and rigorous, comprehensive supplementary material
Overall Rating	9.0/10	Excellent pioneering work

Key References

Kharinaev et al. (2025): First discovery of alignment degradation from quantization
Chen et al. (2025): Q-resafe post-processing method
Hu et al. (2025): OSTQuant framework (base framework for this paper)
Frantar et al. (2023): GPTQ quantization algorithm
Zhang et al. (2024): SafetyBench evaluation benchmark
Ouyang et al. (2022): RLHF alignment method

Summary: This is a high-quality pioneering work that systematically addresses the safety degradation problem in LLM quantization for the first time. The method design is clever, experiments comprehensive, and practical value high. While improvements are possible in large-scale model verification and theoretical depth, it has established important benchmarks and research paradigms for the field. Highly recommended for researchers and engineers in related areas.