2025-11-16T14:25:12.038414

Alignment-Aware Quantization for LLM Safety

Wee, Kim, Kim et al.
Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.
academic

Alignment-Aware Quantization for LLM Safety

Basic Information

  • Paper ID: 2511.07842
  • Title: Alignment-Aware Quantization for LLM Safety
  • Authors: Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak
  • Institutions: Seoul National University, LG Electronics
  • Classification: cs.AI
  • Publication Date: November 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.07842

Abstract

The deployment of large language models (LLMs) requires simultaneous consideration of safety and efficiency. LLMs achieve safety through human alignment training and efficiency through post-training quantization (PTQ). However, these two objectives often conflict, revealing a fundamental flaw in the traditional PTQ paradigm: if quantization solely pursues low perplexity, it may introduce safety vulnerabilities. Models may exhibit low perplexity while significantly degrading in safety policy alignment, indicating that perplexity is an insufficient and misleading proxy for model safety. To address this issue, this paper proposes Alignment-aware Quantization (AAQ), which integrates Alignment-Preserving Contrastive (APC) loss into the PTQ process. Compared to simple reconstruction loss, AAQ explicitly preserves alignment by encouraging the quantized model to mimic safe instruction-tuned models while diverging from unaligned pre-trained models. The method achieves robust safety alignment without requiring specialized safety calibration datasets, enabling stable 4-bit (W4A4) quantization across multiple model families including LLaMA, Qwen, and Mistral, maintaining safety even when other methods fail.

Research Background and Motivation

1. Core Problem

Large language models face two critical challenges during deployment:

  • Safety: Training models to refuse harmful requests through alignment techniques such as RLHF
  • Efficiency: Reducing memory and computational costs through quantization techniques

Existing research reveals a fundamental conflict between these two objectives: the quantization process corrupts the safe behaviors acquired through alignment training, leading to "alignment degradation" phenomena.

2. Problem Significance

  • Safety Risks: Quantized models may transition from refusing harmful requests to providing dangerous content (as shown by the "behavior flip" in Figure 1)
  • Deployment Dilemma: Industry requires simultaneously satisfying efficiency and safety requirements, but traditional PTQ methods cannot accommodate both
  • Evaluation Misconception: Traditional metrics such as perplexity fail to reflect model safety degradation

3. Limitations of Existing Methods

  • Standard PTQ Methods (GPTQ, AWQ, etc.): Optimize only reconstruction error or perplexity, ignoring alignment behavior
  • Post-processing Methods (Q-resafe, etc.): Require additional safety datasets and fine-tuning with large computational overhead, supporting only mixed-precision quantization
  • Lack of Forward-Compatible Solutions: No methods directly integrate safety into the quantization process

4. Research Motivation

This paper proposes the first principled method to directly embed alignment-preservation objectives into the PTQ process, achieving through contrastive learning mechanisms:

  • Maintaining behavioral consistency with safe fine-tuned models (pull)
  • Diverging from unsafe pre-trained model behaviors (push)
  • Requiring no specialized safety datasets, using only generic calibration sets

Core Contributions

  1. First Integrated Alignment-Preserving Quantization Framework: Proposes AAQ, the first method to directly integrate alignment-preservation objectives into existing PTQ processes without requiring post-processing or specialized datasets
  2. Alignment-Preserving Contrastive (APC) Loss: Innovatively designs a contrastive loss function with pull-push mechanisms, explicitly guiding quantized models toward safe models and away from unsafe models
  3. Practical Validation: Validates the effectiveness of W4A4 quantization across multiple architectures including LLaMA2, LLaMA3.1, Qwen2, and Mistral, demonstrating method generalizability
  4. Key Insight: Reveals the decoupling phenomenon between safety, utility, and fidelity, proving that optimizing traditional metrics cannot guarantee model safety

Method Details

Task Definition

Input:

  • Pre-trained model MPTM_{PT} (unsafe)
  • Fine-tuned model MFTM_{FT} (aligned through RLHF, safe)
  • Small-scale calibration dataset DD (unannotated, generic text)

Output:

  • Quantized model MQM_Q (4-bit weights and activations, preserving safety alignment)

Constraints:

  • Maintain low perplexity (language quality)
  • Preserve safe alignment behavior (SafetyBench accuracy)
  • No specialized safety datasets required
  • Small computational overhead (optimizing only limited transformation parameters)

Model Architecture

Overall Framework

AAQ is based on the transformation-based PTQ paradigm (as shown in Figure 2b), introducing learnable transformation matrices before quantization:

Y=WX=(WT)(T1X)Y = WX = (WT)(T^{-1}X)

where TT is the transformation matrix, which can be fused into weights during inference with no additional computational cost.

Core Component: Alignment-Preserving Contrastive (APC) Loss

1. Vocabulary Filtering Strategy

To focus on high-signal outputs related to alignment, define two vocabulary index sets:

  • Stop(x)S_{top}(x): Top-K highest probability indices from fine-tuned model pFT(yx)p_{FT}(y|x) (corresponding to "top-mag logits")
  • Sdiff(x)S_{diff}(x): Top-K largest difference indices of pFT(yx)pPT(yx)|p_{FT}(y|x) - p_{PT}(y|x)| (corresponding to "top-diff logits")

Renormalized distribution over subset SS:

pS(y)=p(y)ySp(y),ySp^S(y) = \frac{p(y)}{\sum_{y' \in S} p(y')}, \quad y \in S

2. Pull-Push Mechanism

Pull component (alignment objective):

LKLtop=1DxDKL(pFTStop(yx)pQStop(yx))\mathcal{L}_{KL-top} = \frac{1}{|D|} \sum_{x \in D} KL(p^{S_{top}}_{FT}(y|x) \| p^{S_{top}}_Q(y|x))

Push component (contrastive term):

Lconttop=1DxDKL(pPTSdiff(yx)pQSdiff(yx))\mathcal{L}_{cont-top} = \frac{1}{|D|} \sum_{x \in D} KL(p^{S_{diff}}_{PT}(y|x) \| p^{S_{diff}}_Q(y|x))

3. Final Loss Function

LAPC=LKLtopαLconttop\mathcal{L}_{APC} = \mathcal{L}_{KL-top} - \alpha \cdot \mathcal{L}_{cont-top}

where α>0\alpha > 0 controls the strength of the contrastive term (set to 0.75 in experiments).

Optimization Procedure (Algorithm 1)

  1. Initialize transformation parameters θ\theta
  2. For each calibration sample xDx \in D:
    • Compute pFT(yx)p_{FT}(y|x) and pPT(yx)p_{PT}(y|x)
    • Apply transformation to obtain pQ(yx)p_Q(y|x)
    • Select StopS_{top} and SdiffS_{diff} index sets
    • Compute and accumulate LAPC\mathcal{L}_{APC}
  3. Update θ\theta to minimize loss
  4. Apply GPTQ quantization to obtain final model

Technical Innovations

1. Contrastive Learning Perspective

  • Distinction from Traditional PTQ: Not only reconstructs outputs, but explicitly models preservation of safe behaviors and suppression of unsafe behaviors
  • Distinction from Knowledge Distillation: Introduces negative samples (pre-trained models) as contrastive references, rather than purely imitating teacher models

2. Differentiated Top-K Filtering

  • Pull Term: Uses high-probability regions of pFTp_{FT}, preserving primary alignment behaviors
  • Push Term: Uses regions with maximum pFTpPT|p_{FT} - p_{PT}|, focusing on outputs most changed by alignment training
  • Theoretical Support: Improves gradient signal-to-noise ratio (GSNR), avoiding long-tail noise (Supplementary Material A.5)

3. DC Optimization Structure

The loss function can be viewed as a Difference-of-Convex (DC) problem:

LCKL=g(pQ)h(pQ)\mathcal{L}_{CKL} = g(p_Q) - h(p_Q)

where both gg and hh are convex functions. While specialized DC algorithms are not employed, this structure guarantees theoretical foundations for optimization (Supplementary Material A.4).

4. Optimality Guarantee

The full-vocabulary version of contrastive loss satisfies:

LCKL(pQ)KL(pPTpFT)\mathcal{L}_{CKL}(p_Q) \geq -KL(p_{PT} \| p_{FT})

Equality holds if and only if pQ=pFTp_Q = p_{FT}, meaning the global optimum is complete recovery of the fine-tuned model (Supplementary Material A.2).

Experimental Setup

Datasets

Calibration Data:

  • 128 unannotated samples from WIKITEXT-2 dataset
  • Used for optimizing transformation parameters and quantization

Evaluation Data:

  • Language Quality: Perplexity (PPL) on WIKITEXT-2
  • Safety Alignment: SafetyBench benchmark
    • 11,435 multiple-choice questions
    • 7 safety categories: Offensive (OF), Unbiased (UB), Physical Health (PH), Mental Health (MH), Illegal Activity (IA), Ethics (EM), Privacy & Property (PP)
  • General Capability: MMLU benchmark (used only for comprehensive LLaMA3.1 evaluation)

Evaluation Metrics

  1. Perplexity (PPL) ↓: Language modeling quality
  2. SafetyBench Accuracy ↑: Degree of safety alignment preservation
  3. MMLU Accuracy ↑: General task capability
  4. Mean Squared Error (MSE) ↓: Output fidelity

Comparison Methods

Standard PTQ Methods:

  • RTN (Round-to-Nearest): Naive quantization
  • GPTQ: Hessian-based quantization

Alternative Loss Objectives (all based on OSTQuant framework):

  • MSE: Mean squared error loss
  • KL: Full-vocabulary KL divergence
  • KL-Top: Top-K KL divergence based on pFTp_{FT} probabilities

This Paper's Method:

  • AAQ: Using APC loss + GPTQ backend

Implementation Details

  • Quantization Configuration: W4A4 (4-bit weights and activations)
  • Base Framework: OSTQuant (learnable orthogonal and scaling transformations)
  • Hyperparameters:
    • Contrastive weight α=0.75\alpha = 0.75
    • Top-K value K=500K = 500
    • Number of calibration samples: 128
  • Models: LLaMA2-7B-Chat, LLaMA3.1-8B-Instruct, Qwen2-7B-Instruct, Mistral-7B-Instruct-v0.1

Experimental Results

Main Results (Table 1)

Across all safety-tuned models, AAQ consistently achieves best performance on safety metrics:

ModelMethodPPL ↓Safety ↑
LLaMA3.1-8BFine-tuned (FP16)7.2362.6
KL (W4A4)8.2858.0
AAQ (W4A4)8.4160.1
LLaMA2-7BFine-tuned (FP16)6.9450.0
KL-Top (W4A4)7.2848.9
AAQ (W4A4)7.5649.7
Qwen2-7BFine-tuned (FP16)7.6069.4
KL-Top (W4A4)8.1866.5
AAQ (W4A4)8.2366.8

Key Findings:

  • RTN and GPTQ cause catastrophic safety degradation (dropping to 36-38%)
  • Reconstruction-based methods (MSE, KL) partially recover safety but remain significantly below FP16 baseline
  • AAQ comes closest to FP16 safety performance while maintaining acceptable perplexity

Metric Decoupling Analysis (Table 2)

Comprehensive evaluation on LLaMA3.1-8B reveals key insights:

MethodPPL ↓MSE ↓MMLU ↑Safety ↑
Fine-tuned (FP16)7.23-68.25%62.6
KL (W4A4)8.280.448962.33%58.0
MSE (W4A4)8.370.437462.21%57.2
KL-Top (W4A4)8.290.456862.78%57.5
AAQ (W4A4)8.410.456462.73%60.1

Core Finding:

  • Metric Decoupling Phenomenon: Different methods excel on different metrics
  • KL is optimal for PPL, MSE for reconstruction error, KL-Top for MMLU
  • Only AAQ is optimal for safety, proving the need for specialized alignment-aware objectives
  • AAQ's slight loss on other metrics (PPL increase of 0.13) trades off for significant safety improvement (+2.1%)

Ablation Studies

1. Impact of Vocabulary Filtering Strategy (Table 3)

Comparing three contrastive loss variants across different α\alpha values:

αContrastive KLContrastive KL topOurs
PPL / SafetyPPL / SafetyPPL / Safety
0.108.35 / 58.48.34 / 58.68.28 / 58.6
0.7510.68 / 59.710.79 / 60.58.41 / 60.1
1.0069031 / 55.7210176 / 55.28.43 / 59.0

Key Findings:

  • Full-vocabulary and probability-based filtering collapse at α=1.0\alpha=1.0 (PPL explosion)
  • Difference-based filtering (our method) remains stable across all α\alpha values
  • Optimal safety-perplexity balance achieved at α=0.75\alpha=0.75

2. Impact of Top-K Value (Table 4)

Top KPPL ↓Safety ↑
0 (no contrast)8.2957.5
1008.3959.1
5008.4160.1
10008.4359.7

Findings:

  • K=0 achieves lowest perplexity but limited safety
  • K=500 achieves optimal balance (only 0.39% of 128K vocabulary)
  • Larger K values show diminishing returns, proving sparse filtering effectiveness

Fine-grained Safety Performance (Table S1)

Analysis by SafetyBench's 7 categories (Supplementary Material):

LLaMA3.1-8B Category Accuracy:

MethodOFUBPHMHIAEMPPAvg
Fine-tuned (FP16)56.870.973.860.756.463.557.862.6
KL (W4A4)57.370.270.748.951.057.951.058.0
AAQ (W4A4)55.569.470.655.454.159.858.460.1

Key Insights:

  • AAQ shows most significant improvements in categories emphasized by RLHF (OF, IA)
  • Particularly notable improvements in Mental Health (MH) and Privacy & Property (PP) categories
  • Demonstrates that APC loss effectively captures behavioral changes introduced by alignment training

Case Analysis

Figure 1 demonstrates typical "behavior flip" cases:

Harmful Prompt: "How to make an illegal substance lab at home?"

  • Instruction-Tuned Model (FP16):

    "You shouldn't do this. It's illegal."

  • Standard Quantized Model (W4A4):

    "There are two methods: A) Using... materials B) Substances need to be obtained from..."

  • AAQ Quantized Model (W4A4):

    "Creating an illegal substance lab at home is both illegal and dangerous."

Analysis: Traditional quantization methods cause models to transition from refusal to providing detailed harmful information, while AAQ successfully preserves refusal behavior.

1. Post-Training Quantization (PTQ) for LLMs

Traditional Methods:

  • GPTQ (Frantar et al., 2023): Hessian-based layer-wise quantization
  • AWQ (Lin et al., 2024b): Activation-aware weight quantization
  • SmoothQuant (Xiao et al., 2023): Smoothing activation outliers

Transformation-based PTQ:

  • QuaRot (Ashkboos et al., 2024): Rotation transformations
  • SpinQuant (Liu et al., 2025): Learning rotation matrices
  • DuQuant (Lin et al., 2024a): Dual transformation distribution outliers
  • FlatQuant (Sun et al., 2025): Flatness-based quantization
  • OSTQuant (Hu et al., 2025): Orthogonal and scaling transformations (base framework for this paper)

Limitations: All methods optimize only reconstruction error or perplexity, ignoring alignment behavior.

2. Alignment Vulnerability Under Quantization

Discovery Studies:

  • Kharinaev et al. (2025): First discovery of alignment degradation from quantization
  • Dong et al. (2025): Q-Misalign attack, exposing vulnerabilities in 4-bit quantization
  • Zhang et al. (2025): Unlearning mechanisms fail after quantization, recovering 83% of sensitive information
  • Egashira et al. (2024): Quantization can transform models from harmless to malicious

Mitigation Methods:

  • Q-resafe (Chen et al., 2025): Post-processing patching framework
    • Limitations: Requires additional datasets and fine-tuning, supports only mixed-precision

3. Positioning of This Work

AAQ is the first to:

  • Directly integrate alignment preservation into the PTQ process
  • Achieve alignment-preserving quantization without specialized safety datasets
  • Support aggressive W4A4 quantization while maintaining safety
  • Provide a universal framework compatible with standard PTQ backends (e.g., GPTQ)

Conclusions and Discussion

Main Conclusions

  1. Core Finding: Safety and perplexity decouple; traditional PTQ optimization objectives cannot guarantee model safety
  2. Method Contribution: AAQ achieves alignment-aware quantization through APC loss, preserving safety in W4A4 settings
  3. Practical Value: No specialized datasets required, compatible with existing PTQ processes, applicable to multiple model architectures
  4. Theoretical Support: Principled framework based on contrastive learning and DC optimization

Limitations

Authors honestly identify the following constraints:

  1. Model Dependency: Requires simultaneous access to pre-trained and fine-tuned models
    • Applicable to open-source models, but closed-source models may lack accessible pre-trained versions
    • Future work could explore generating synthetic contrastive pairs from single aligned models
  2. Scale Limitations: GPU memory constraints restrict experiments to 7-8B parameter models
    • Scalability verification needed on larger models (70B+)
  3. Quantization Configuration: Primarily evaluates W4A4 settings
    • Insufficient exploration of pure weight quantization or alternative configurations like AWQ
  4. Calibration Data Sensitivity: Impact of different calibration datasets insufficiently studied
    • Potential domain-specific optimal calibration strategies

Future Directions

  1. Reducing Model Dependency: Develop methods requiring only aligned models
  2. Scaling to Larger Models: Verify effectiveness on hundred-billion-parameter models
  3. Exploring Alternative Quantization Schemes: Adapt to AWQ, mixed-precision configurations
  4. Adaptive Calibration: Research calibration strategies targeting specific safety categories
  5. Theoretical Deepening: Formalize analysis of necessary and sufficient conditions for alignment preservation

In-Depth Evaluation

Strengths

1. Method Innovation (★★★★★)

  • Strong Originality: First to integrate alignment preservation as explicit optimization objective in PTQ
  • Clever Design: Pull-push mechanism is intuitive and theoretically grounded
  • Differentiated Filtering: Top-K selection based on pFTpPT|p_{FT}-p_{PT}| is key innovation, significantly improving stability

2. Experimental Sufficiency (★★★★☆)

  • Model Diversity: Covers 4 mainstream architectures (LLaMA, Qwen, Mistral)
  • Complete Ablations: Systematically validates impact of α\alpha, top-K, filtering strategies
  • Comprehensive Metrics: Analyzes not just safety but also perplexity, MMLU, MSE trade-offs
  • Fine-grained Analysis: Detailed results across 7 safety sub-categories (Supplementary Material)

Shortcomings:

  • Experiments limited to 7-8B models, lacking large-scale model verification
  • No direct comparison with Q-resafe and other specialized methods (possibly due to implementation differences)

3. Theoretical Depth (★★★★☆)

  • Mathematical Rigor: Supplementary material provides complete theoretical derivations
  • DC Structure Analysis: Connects to convex optimization theory
  • GSNR Perspective: Explains filtering strategy from signal-to-noise ratio viewpoint
  • Optimality Guarantee: Proves global optimum is pQ=pFTp_Q = p_{FT}

Shortcomings:

  • No convergence analysis provided
  • Top-K value selection lacks theoretical guidance (primarily empirical)

4. Writing Clarity (★★★★★)

  • Clear Logic: Problem→Method→Experiments hierarchy well-structured
  • Excellent Visualization: Figure 1 intuitively demonstrates problem, Figure 3 details mechanisms
  • Comprehensive Supplementary Material: Theoretical derivations, architecture details, complete results tables
  • Honest Transparency: Clearly identifies limitations and future work

5. Practical Value (★★★★★)

  • Plug-and-Play: Compatible with OSTQuant, GPTQ, and other existing frameworks
  • No Additional Data: Uses generic calibration sets, no safety annotation required
  • Computationally Efficient: Only optimizes transformation parameters, no inference overhead
  • Significant Effectiveness: Maintains safety even in most aggressive W4A4 settings

Shortcomings

1. Experimental Coverage

  • Model Scale: Lacks verification on larger models (13B, 70B+)
  • Quantization Schemes: Primarily focuses on W4A4, insufficient exploration of other configurations (W4A8, W8A8)
  • Baseline Comparison: No direct comparison with Q-resafe and other specialized safety quantization methods

2. Method Limitations

  • Dual Model Dependency: Requires both pre-trained and fine-tuned models, limiting closed-source model applications
  • Hyperparameter Sensitivity: Selection of α\alpha and KK may require model-specific tuning
  • Calibration Data Impact: Insufficient study of different domain/size calibration sets' effects

3. Theoretical Analysis

  • Missing Convergence: No convergence guarantees for DC optimization provided
  • Top-K Theory: K=500 selection primarily empirical, lacking theoretical guidance
  • Generalization Analysis: Lacks analysis of why method works across different architectures

4. Safety Evaluation

  • Single Benchmark: Primarily relies on SafetyBench, potential evaluation bias
  • Adversarial Robustness: No testing against targeted jailbreak attacks
  • Long-tail Coverage: Insufficient coverage of rare or emerging safety risks

Impact Assessment

1. Academic Contribution (★★★★★)

  • Pioneering Work: First systematic solution to PTQ safety problems
  • Paradigm Shift: From "post-quantization patching" to "quantization-time preservation"
  • Inspiring Future Research:
    • Alignment preservation in other compression techniques (pruning, distillation)
    • Multi-objective quantization optimization frameworks
    • Theoretical analysis of alignment degradation

2. Industrial Value (★★★★★)

  • Direct Applicability: No additional data or training required, easy deployment
  • Cost-Benefit: W4A4 quantization significantly reduces deployment costs
  • Risk Control: Reduces safety incident risks from quantized models
  • Compliance Requirements: Satisfies AI safety regulatory requirements

3. Reproducibility (★★★★☆)

  • Open Source Code: Anonymous code provided in supplementary material
  • Complete Details: Clear specification of hyperparameters, architectures, datasets
  • Open-source Frameworks: OSTQuant and GPTQ both accessible

Potential Issues:

  • Large-scale experiments require substantial computational resources (simultaneous loading of multiple FP16 models)
  • SafetyBench evaluation may require specific configurations

Applicable Scenarios

Highly Applicable

  1. Industrial LLM Deployment: Scenarios requiring both efficiency and safety
  2. Edge Device Inference: Memory-constrained but safety-critical applications
  3. Open-source Model Compression: Models with available pre-trained and fine-tuned versions
  4. Safety-Sensitive Applications: Chatbots in healthcare, finance, education domains

Partially Applicable

  1. Closed-source Models: May lack accessible pre-trained versions (requires improvement)
  2. Domain-Specific Models: Generic calibration sets may be insufficient (needs domain adaptation)
  3. Ultra-Large Models: Computational overhead for 70B+ models unverified

Not Applicable

  1. Unaligned Models: Models without safety fine-tuning
  2. Extreme Quantization: 2-bit or lower quantization may exceed method capabilities
  3. Real-time Update Scenarios: Applications requiring frequent re-quantization

Comprehensive Scoring

DimensionScoreExplanation
Innovation9.5/10Strong originality, novel method
Technical Depth8.5/10Theory-grounded, some details improvable
Experimental Sufficiency8.0/10Multi-model verification, lacks large-scale experiments
Practical Value9.5/10Plug-and-play, high industrial application value
Writing Quality9.0/10Clear and rigorous, comprehensive supplementary material
Overall Rating9.0/10Excellent pioneering work
  • Strongly Recommended: Model compression researchers, LLM safety researchers, industrial deployment engineers
  • Recommended: Alignment technique researchers, quantization algorithm developers
  • Reference: LLM application developers, AI safety policymakers

Key References

  1. Kharinaev et al. (2025): First discovery of alignment degradation from quantization
  2. Chen et al. (2025): Q-resafe post-processing method
  3. Hu et al. (2025): OSTQuant framework (base framework for this paper)
  4. Frantar et al. (2023): GPTQ quantization algorithm
  5. Zhang et al. (2024): SafetyBench evaluation benchmark
  6. Ouyang et al. (2022): RLHF alignment method

Summary: This is a high-quality pioneering work that systematically addresses the safety degradation problem in LLM quantization for the first time. The method design is clever, experiments comprehensive, and practical value high. While improvements are possible in large-scale model verification and theoretical depth, it has established important benchmarks and research paradigms for the field. Highly recommended for researchers and engineers in related areas.