Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.
- Paper ID: 2511.07842
- Title: Alignment-Aware Quantization for LLM Safety
- Authors: Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
- Institutions: Seoul National University, LG Electronics
- Classification: cs.AI
- Publication Date: November 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2511.07842
The deployment of large language models (LLMs) requires simultaneous consideration of safety and efficiency. LLMs achieve safety through human alignment training and efficiency through post-training quantization (PTQ). However, these two objectives often conflict, revealing a fundamental flaw in the traditional PTQ paradigm: if quantization solely pursues low perplexity, it may introduce safety vulnerabilities. Models may exhibit low perplexity while significantly degrading in safety policy alignment, indicating that perplexity is an insufficient and misleading proxy for model safety. To address this issue, this paper proposes Alignment-aware Quantization (AAQ), which integrates Alignment-Preserving Contrastive (APC) loss into the PTQ process. Compared to simple reconstruction loss, AAQ explicitly preserves alignment by encouraging the quantized model to mimic safe instruction-tuned models while diverging from unaligned pre-trained models. The method achieves robust safety alignment without requiring specialized safety calibration datasets, enabling stable 4-bit (W4A4) quantization across multiple model families including LLaMA, Qwen, and Mistral, maintaining safety even when other methods fail.
Large language models face two critical challenges during deployment:
- Safety: Training models to refuse harmful requests through alignment techniques such as RLHF
- Efficiency: Reducing memory and computational costs through quantization techniques
Existing research reveals a fundamental conflict between these two objectives: the quantization process corrupts the safe behaviors acquired through alignment training, leading to "alignment degradation" phenomena.
- Safety Risks: Quantized models may transition from refusing harmful requests to providing dangerous content (as shown by the "behavior flip" in Figure 1)
- Deployment Dilemma: Industry requires simultaneously satisfying efficiency and safety requirements, but traditional PTQ methods cannot accommodate both
- Evaluation Misconception: Traditional metrics such as perplexity fail to reflect model safety degradation
- Standard PTQ Methods (GPTQ, AWQ, etc.): Optimize only reconstruction error or perplexity, ignoring alignment behavior
- Post-processing Methods (Q-resafe, etc.): Require additional safety datasets and fine-tuning with large computational overhead, supporting only mixed-precision quantization
- Lack of Forward-Compatible Solutions: No methods directly integrate safety into the quantization process
This paper proposes the first principled method to directly embed alignment-preservation objectives into the PTQ process, achieving through contrastive learning mechanisms:
- Maintaining behavioral consistency with safe fine-tuned models (pull)
- Diverging from unsafe pre-trained model behaviors (push)
- Requiring no specialized safety datasets, using only generic calibration sets
- First Integrated Alignment-Preserving Quantization Framework: Proposes AAQ, the first method to directly integrate alignment-preservation objectives into existing PTQ processes without requiring post-processing or specialized datasets
- Alignment-Preserving Contrastive (APC) Loss: Innovatively designs a contrastive loss function with pull-push mechanisms, explicitly guiding quantized models toward safe models and away from unsafe models
- Practical Validation: Validates the effectiveness of W4A4 quantization across multiple architectures including LLaMA2, LLaMA3.1, Qwen2, and Mistral, demonstrating method generalizability
- Key Insight: Reveals the decoupling phenomenon between safety, utility, and fidelity, proving that optimizing traditional metrics cannot guarantee model safety
Input:
- Pre-trained model MPT (unsafe)
- Fine-tuned model MFT (aligned through RLHF, safe)
- Small-scale calibration dataset D (unannotated, generic text)
Output:
- Quantized model MQ (4-bit weights and activations, preserving safety alignment)
Constraints:
- Maintain low perplexity (language quality)
- Preserve safe alignment behavior (SafetyBench accuracy)
- No specialized safety datasets required
- Small computational overhead (optimizing only limited transformation parameters)
AAQ is based on the transformation-based PTQ paradigm (as shown in Figure 2b), introducing learnable transformation matrices before quantization:
Y=WX=(WT)(T−1X)
where T is the transformation matrix, which can be fused into weights during inference with no additional computational cost.
1. Vocabulary Filtering Strategy
To focus on high-signal outputs related to alignment, define two vocabulary index sets:
- Stop(x): Top-K highest probability indices from fine-tuned model pFT(y∣x) (corresponding to "top-mag logits")
- Sdiff(x): Top-K largest difference indices of ∣pFT(y∣x)−pPT(y∣x)∣ (corresponding to "top-diff logits")
Renormalized distribution over subset S:
pS(y)=∑y′∈Sp(y′)p(y),y∈S
2. Pull-Push Mechanism
Pull component (alignment objective):
LKL−top=∣D∣1∑x∈DKL(pFTStop(y∣x)∥pQStop(y∣x))
Push component (contrastive term):
Lcont−top=∣D∣1∑x∈DKL(pPTSdiff(y∣x)∥pQSdiff(y∣x))
3. Final Loss Function
LAPC=LKL−top−α⋅Lcont−top
where α>0 controls the strength of the contrastive term (set to 0.75 in experiments).
- Initialize transformation parameters θ
- For each calibration sample x∈D:
- Compute pFT(y∣x) and pPT(y∣x)
- Apply transformation to obtain pQ(y∣x)
- Select Stop and Sdiff index sets
- Compute and accumulate LAPC
- Update θ to minimize loss
- Apply GPTQ quantization to obtain final model
- Distinction from Traditional PTQ: Not only reconstructs outputs, but explicitly models preservation of safe behaviors and suppression of unsafe behaviors
- Distinction from Knowledge Distillation: Introduces negative samples (pre-trained models) as contrastive references, rather than purely imitating teacher models
- Pull Term: Uses high-probability regions of pFT, preserving primary alignment behaviors
- Push Term: Uses regions with maximum ∣pFT−pPT∣, focusing on outputs most changed by alignment training
- Theoretical Support: Improves gradient signal-to-noise ratio (GSNR), avoiding long-tail noise (Supplementary Material A.5)
The loss function can be viewed as a Difference-of-Convex (DC) problem:
LCKL=g(pQ)−h(pQ)
where both g and h are convex functions. While specialized DC algorithms are not employed, this structure guarantees theoretical foundations for optimization (Supplementary Material A.4).
The full-vocabulary version of contrastive loss satisfies:
LCKL(pQ)≥−KL(pPT∥pFT)
Equality holds if and only if pQ=pFT, meaning the global optimum is complete recovery of the fine-tuned model (Supplementary Material A.2).
Calibration Data:
- 128 unannotated samples from WIKITEXT-2 dataset
- Used for optimizing transformation parameters and quantization
Evaluation Data:
- Language Quality: Perplexity (PPL) on WIKITEXT-2
- Safety Alignment: SafetyBench benchmark
- 11,435 multiple-choice questions
- 7 safety categories: Offensive (OF), Unbiased (UB), Physical Health (PH), Mental Health (MH), Illegal Activity (IA), Ethics (EM), Privacy & Property (PP)
- General Capability: MMLU benchmark (used only for comprehensive LLaMA3.1 evaluation)
- Perplexity (PPL) ↓: Language modeling quality
- SafetyBench Accuracy ↑: Degree of safety alignment preservation
- MMLU Accuracy ↑: General task capability
- Mean Squared Error (MSE) ↓: Output fidelity
Standard PTQ Methods:
- RTN (Round-to-Nearest): Naive quantization
- GPTQ: Hessian-based quantization
Alternative Loss Objectives (all based on OSTQuant framework):
- MSE: Mean squared error loss
- KL: Full-vocabulary KL divergence
- KL-Top: Top-K KL divergence based on pFT probabilities
This Paper's Method:
- AAQ: Using APC loss + GPTQ backend
- Quantization Configuration: W4A4 (4-bit weights and activations)
- Base Framework: OSTQuant (learnable orthogonal and scaling transformations)
- Hyperparameters:
- Contrastive weight α=0.75
- Top-K value K=500
- Number of calibration samples: 128
- Models: LLaMA2-7B-Chat, LLaMA3.1-8B-Instruct, Qwen2-7B-Instruct, Mistral-7B-Instruct-v0.1
Across all safety-tuned models, AAQ consistently achieves best performance on safety metrics:
| Model | Method | PPL ↓ | Safety ↑ |
|---|
| LLaMA3.1-8B | Fine-tuned (FP16) | 7.23 | 62.6 |
| KL (W4A4) | 8.28 | 58.0 |
| AAQ (W4A4) | 8.41 | 60.1 |
| LLaMA2-7B | Fine-tuned (FP16) | 6.94 | 50.0 |
| KL-Top (W4A4) | 7.28 | 48.9 |
| AAQ (W4A4) | 7.56 | 49.7 |
| Qwen2-7B | Fine-tuned (FP16) | 7.60 | 69.4 |
| KL-Top (W4A4) | 8.18 | 66.5 |
| AAQ (W4A4) | 8.23 | 66.8 |
Key Findings:
- RTN and GPTQ cause catastrophic safety degradation (dropping to 36-38%)
- Reconstruction-based methods (MSE, KL) partially recover safety but remain significantly below FP16 baseline
- AAQ comes closest to FP16 safety performance while maintaining acceptable perplexity
Comprehensive evaluation on LLaMA3.1-8B reveals key insights:
| Method | PPL ↓ | MSE ↓ | MMLU ↑ | Safety ↑ |
|---|
| Fine-tuned (FP16) | 7.23 | - | 68.25% | 62.6 |
| KL (W4A4) | 8.28 | 0.4489 | 62.33% | 58.0 |
| MSE (W4A4) | 8.37 | 0.4374 | 62.21% | 57.2 |
| KL-Top (W4A4) | 8.29 | 0.4568 | 62.78% | 57.5 |
| AAQ (W4A4) | 8.41 | 0.4564 | 62.73% | 60.1 |
Core Finding:
- Metric Decoupling Phenomenon: Different methods excel on different metrics
- KL is optimal for PPL, MSE for reconstruction error, KL-Top for MMLU
- Only AAQ is optimal for safety, proving the need for specialized alignment-aware objectives
- AAQ's slight loss on other metrics (PPL increase of 0.13) trades off for significant safety improvement (+2.1%)
Comparing three contrastive loss variants across different α values:
| α | Contrastive KL | Contrastive KL top | Ours |
|---|
| PPL / Safety | PPL / Safety | PPL / Safety |
| 0.10 | 8.35 / 58.4 | 8.34 / 58.6 | 8.28 / 58.6 |
| 0.75 | 10.68 / 59.7 | 10.79 / 60.5 | 8.41 / 60.1 |
| 1.00 | 69031 / 55.7 | 210176 / 55.2 | 8.43 / 59.0 |
Key Findings:
- Full-vocabulary and probability-based filtering collapse at α=1.0 (PPL explosion)
- Difference-based filtering (our method) remains stable across all α values
- Optimal safety-perplexity balance achieved at α=0.75
| Top K | PPL ↓ | Safety ↑ |
|---|
| 0 (no contrast) | 8.29 | 57.5 |
| 100 | 8.39 | 59.1 |
| 500 | 8.41 | 60.1 |
| 1000 | 8.43 | 59.7 |
Findings:
- K=0 achieves lowest perplexity but limited safety
- K=500 achieves optimal balance (only 0.39% of 128K vocabulary)
- Larger K values show diminishing returns, proving sparse filtering effectiveness
Analysis by SafetyBench's 7 categories (Supplementary Material):
LLaMA3.1-8B Category Accuracy:
| Method | OF | UB | PH | MH | IA | EM | PP | Avg |
|---|
| Fine-tuned (FP16) | 56.8 | 70.9 | 73.8 | 60.7 | 56.4 | 63.5 | 57.8 | 62.6 |
| KL (W4A4) | 57.3 | 70.2 | 70.7 | 48.9 | 51.0 | 57.9 | 51.0 | 58.0 |
| AAQ (W4A4) | 55.5 | 69.4 | 70.6 | 55.4 | 54.1 | 59.8 | 58.4 | 60.1 |
Key Insights:
- AAQ shows most significant improvements in categories emphasized by RLHF (OF, IA)
- Particularly notable improvements in Mental Health (MH) and Privacy & Property (PP) categories
- Demonstrates that APC loss effectively captures behavioral changes introduced by alignment training
Figure 1 demonstrates typical "behavior flip" cases:
Harmful Prompt: "How to make an illegal substance lab at home?"
- Instruction-Tuned Model (FP16):
"You shouldn't do this. It's illegal."
- Standard Quantized Model (W4A4):
"There are two methods: A) Using... materials B) Substances need to be obtained from..."
- AAQ Quantized Model (W4A4):
"Creating an illegal substance lab at home is both illegal and dangerous."
Analysis: Traditional quantization methods cause models to transition from refusal to providing detailed harmful information, while AAQ successfully preserves refusal behavior.
Traditional Methods:
- GPTQ (Frantar et al., 2023): Hessian-based layer-wise quantization
- AWQ (Lin et al., 2024b): Activation-aware weight quantization
- SmoothQuant (Xiao et al., 2023): Smoothing activation outliers
Transformation-based PTQ:
- QuaRot (Ashkboos et al., 2024): Rotation transformations
- SpinQuant (Liu et al., 2025): Learning rotation matrices
- DuQuant (Lin et al., 2024a): Dual transformation distribution outliers
- FlatQuant (Sun et al., 2025): Flatness-based quantization
- OSTQuant (Hu et al., 2025): Orthogonal and scaling transformations (base framework for this paper)
Limitations: All methods optimize only reconstruction error or perplexity, ignoring alignment behavior.
Discovery Studies:
- Kharinaev et al. (2025): First discovery of alignment degradation from quantization
- Dong et al. (2025): Q-Misalign attack, exposing vulnerabilities in 4-bit quantization
- Zhang et al. (2025): Unlearning mechanisms fail after quantization, recovering 83% of sensitive information
- Egashira et al. (2024): Quantization can transform models from harmless to malicious
Mitigation Methods:
- Q-resafe (Chen et al., 2025): Post-processing patching framework
- Limitations: Requires additional datasets and fine-tuning, supports only mixed-precision
AAQ is the first to:
- Directly integrate alignment preservation into the PTQ process
- Achieve alignment-preserving quantization without specialized safety datasets
- Support aggressive W4A4 quantization while maintaining safety
- Provide a universal framework compatible with standard PTQ backends (e.g., GPTQ)
- Core Finding: Safety and perplexity decouple; traditional PTQ optimization objectives cannot guarantee model safety
- Method Contribution: AAQ achieves alignment-aware quantization through APC loss, preserving safety in W4A4 settings
- Practical Value: No specialized datasets required, compatible with existing PTQ processes, applicable to multiple model architectures
- Theoretical Support: Principled framework based on contrastive learning and DC optimization
Authors honestly identify the following constraints:
- Model Dependency: Requires simultaneous access to pre-trained and fine-tuned models
- Applicable to open-source models, but closed-source models may lack accessible pre-trained versions
- Future work could explore generating synthetic contrastive pairs from single aligned models
- Scale Limitations: GPU memory constraints restrict experiments to 7-8B parameter models
- Scalability verification needed on larger models (70B+)
- Quantization Configuration: Primarily evaluates W4A4 settings
- Insufficient exploration of pure weight quantization or alternative configurations like AWQ
- Calibration Data Sensitivity: Impact of different calibration datasets insufficiently studied
- Potential domain-specific optimal calibration strategies
- Reducing Model Dependency: Develop methods requiring only aligned models
- Scaling to Larger Models: Verify effectiveness on hundred-billion-parameter models
- Exploring Alternative Quantization Schemes: Adapt to AWQ, mixed-precision configurations
- Adaptive Calibration: Research calibration strategies targeting specific safety categories
- Theoretical Deepening: Formalize analysis of necessary and sufficient conditions for alignment preservation
- Strong Originality: First to integrate alignment preservation as explicit optimization objective in PTQ
- Clever Design: Pull-push mechanism is intuitive and theoretically grounded
- Differentiated Filtering: Top-K selection based on ∣pFT−pPT∣ is key innovation, significantly improving stability
- Model Diversity: Covers 4 mainstream architectures (LLaMA, Qwen, Mistral)
- Complete Ablations: Systematically validates impact of α, top-K, filtering strategies
- Comprehensive Metrics: Analyzes not just safety but also perplexity, MMLU, MSE trade-offs
- Fine-grained Analysis: Detailed results across 7 safety sub-categories (Supplementary Material)
Shortcomings:
- Experiments limited to 7-8B models, lacking large-scale model verification
- No direct comparison with Q-resafe and other specialized methods (possibly due to implementation differences)
- Mathematical Rigor: Supplementary material provides complete theoretical derivations
- DC Structure Analysis: Connects to convex optimization theory
- GSNR Perspective: Explains filtering strategy from signal-to-noise ratio viewpoint
- Optimality Guarantee: Proves global optimum is pQ=pFT
Shortcomings:
- No convergence analysis provided
- Top-K value selection lacks theoretical guidance (primarily empirical)
- Clear Logic: Problem→Method→Experiments hierarchy well-structured
- Excellent Visualization: Figure 1 intuitively demonstrates problem, Figure 3 details mechanisms
- Comprehensive Supplementary Material: Theoretical derivations, architecture details, complete results tables
- Honest Transparency: Clearly identifies limitations and future work
- Plug-and-Play: Compatible with OSTQuant, GPTQ, and other existing frameworks
- No Additional Data: Uses generic calibration sets, no safety annotation required
- Computationally Efficient: Only optimizes transformation parameters, no inference overhead
- Significant Effectiveness: Maintains safety even in most aggressive W4A4 settings
- Model Scale: Lacks verification on larger models (13B, 70B+)
- Quantization Schemes: Primarily focuses on W4A4, insufficient exploration of other configurations (W4A8, W8A8)
- Baseline Comparison: No direct comparison with Q-resafe and other specialized safety quantization methods
- Dual Model Dependency: Requires both pre-trained and fine-tuned models, limiting closed-source model applications
- Hyperparameter Sensitivity: Selection of α and K may require model-specific tuning
- Calibration Data Impact: Insufficient study of different domain/size calibration sets' effects
- Missing Convergence: No convergence guarantees for DC optimization provided
- Top-K Theory: K=500 selection primarily empirical, lacking theoretical guidance
- Generalization Analysis: Lacks analysis of why method works across different architectures
- Single Benchmark: Primarily relies on SafetyBench, potential evaluation bias
- Adversarial Robustness: No testing against targeted jailbreak attacks
- Long-tail Coverage: Insufficient coverage of rare or emerging safety risks
- Pioneering Work: First systematic solution to PTQ safety problems
- Paradigm Shift: From "post-quantization patching" to "quantization-time preservation"
- Inspiring Future Research:
- Alignment preservation in other compression techniques (pruning, distillation)
- Multi-objective quantization optimization frameworks
- Theoretical analysis of alignment degradation
- Direct Applicability: No additional data or training required, easy deployment
- Cost-Benefit: W4A4 quantization significantly reduces deployment costs
- Risk Control: Reduces safety incident risks from quantized models
- Compliance Requirements: Satisfies AI safety regulatory requirements
- Open Source Code: Anonymous code provided in supplementary material
- Complete Details: Clear specification of hyperparameters, architectures, datasets
- Open-source Frameworks: OSTQuant and GPTQ both accessible
Potential Issues:
- Large-scale experiments require substantial computational resources (simultaneous loading of multiple FP16 models)
- SafetyBench evaluation may require specific configurations
- Industrial LLM Deployment: Scenarios requiring both efficiency and safety
- Edge Device Inference: Memory-constrained but safety-critical applications
- Open-source Model Compression: Models with available pre-trained and fine-tuned versions
- Safety-Sensitive Applications: Chatbots in healthcare, finance, education domains
- Closed-source Models: May lack accessible pre-trained versions (requires improvement)
- Domain-Specific Models: Generic calibration sets may be insufficient (needs domain adaptation)
- Ultra-Large Models: Computational overhead for 70B+ models unverified
- Unaligned Models: Models without safety fine-tuning
- Extreme Quantization: 2-bit or lower quantization may exceed method capabilities
- Real-time Update Scenarios: Applications requiring frequent re-quantization
| Dimension | Score | Explanation |
|---|
| Innovation | 9.5/10 | Strong originality, novel method |
| Technical Depth | 8.5/10 | Theory-grounded, some details improvable |
| Experimental Sufficiency | 8.0/10 | Multi-model verification, lacks large-scale experiments |
| Practical Value | 9.5/10 | Plug-and-play, high industrial application value |
| Writing Quality | 9.0/10 | Clear and rigorous, comprehensive supplementary material |
| Overall Rating | 9.0/10 | Excellent pioneering work |
- Strongly Recommended: Model compression researchers, LLM safety researchers, industrial deployment engineers
- Recommended: Alignment technique researchers, quantization algorithm developers
- Reference: LLM application developers, AI safety policymakers
- Kharinaev et al. (2025): First discovery of alignment degradation from quantization
- Chen et al. (2025): Q-resafe post-processing method
- Hu et al. (2025): OSTQuant framework (base framework for this paper)
- Frantar et al. (2023): GPTQ quantization algorithm
- Zhang et al. (2024): SafetyBench evaluation benchmark
- Ouyang et al. (2022): RLHF alignment method
Summary: This is a high-quality pioneering work that systematically addresses the safety degradation problem in LLM quantization for the first time. The method design is clever, experiments comprehensive, and practical value high. While improvements are possible in large-scale model verification and theoretical depth, it has established important benchmarks and research paradigms for the field. Highly recommended for researchers and engineers in related areas.