Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
Rezkellah, Dakhmouche
With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.
academic
Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
With the widespread adoption of Large Language Models (LLMs), greater customization is required to ensure privacy protection and safe generation. This paper addresses this objective from two critical perspectives: forgetting of sensitive information and robustness against jailbreak attacks. The researchers propose various constrained optimization formulations that unify both aspects by finding the minimal possible interventions on LLM weights to make a given vocabulary set unreachable or enhance LLM robustness to customized attacks by shifting partial weights to safer regions. The approach requires no oracle classifier, which is typically unavailable or represents significant computational overhead. Remarkably, the authors find that the simplest point-constrained intervention method outperforms the more complex max-min intervention in terms of performance while incurring lower computational costs.
Machine Unlearning Problem: How to remove certain information (specific vocabulary sets) from a language model's generation space with minimal computational cost
Adversarial Robustness Problem: How to make language models more robust to jailbreak adversarial attacks that lead to dangerous or toxic content
With the deployment of LLMs in safety-sensitive applications (such as online content moderation and confidential data processing), ensuring the safety of generative model outputs has become a critical requirement. Existing methods face trade-offs between computational efficiency and defense effectiveness.
Fine-tuning and Model Enhancement: High computational overhead
Prompt-based Defense: Fragile and susceptible to adversarial manipulation
Lightweight Probe Methods: Limited by scarce training data and ineffective against adversarial attacks
Unlearning Methods: Primarily rely on partial retraining through teacher-student frameworks or iterative fine-tuning, incurring high computational costs
Inspired by principled robustness methods in regression, the authors propose a unified framework addressing both adversarial robustness and unlearning by leveraging the fact that information is implicitly stored along latent space pathways.
Unified Framework: Proposes and solves various constrained optimization problems enabling LLMs to simultaneously possess adversarial attack robustness and the ability to forget unnecessary content
No External Classifier Required: Overcomes the need for artificial probes by introducing continuous relaxation over the prompt space and performing direct constrained interventions on concept embeddings
Performance Improvement: Demonstrates performance gains compared to state-of-the-art defense algorithms and establishes new state-of-the-art results for economical unlearning on LLMs
Computational Efficiency: The simplest point-constrained method outperforms complex max-min interventions in both performance and computational cost
Advantages: Does not require examples of harmful generation; solvable via projected gradient descent
Disadvantages: Constraints on safe generation are soft constraints, resulting in weaker performance
Characteristics: Considers worst-case input scenarios; uses probability relaxation to handle discrete structures
Disadvantages: Requires knowledge of harmful concept sets; may be overly conservative
A simple point-constraint strategy based on minimal intervention, ensuring LLM MLP activations do not equal dangerous output embeddings for jailbreak prompts:
min_{θ^{(l)}∈R^{d_l}} ‖δ_l‖_2^2
subject to ‖o^{(l)}(x; θ + δ_l) - c_i‖_2 ≥ ε, ∀i ≤ n
Advantages: Semi-closed-form solution based on KKT conditions; high computational efficiency; best performance
Disadvantages: Requires predefined forbidden concept set C
Point-Constrained Method is Optimal: The simplest PCR method outperforms more complex TSR and ARR methods in both performance and computational efficiency
Unified Framework is Effective: A single method can simultaneously handle both unlearning and robustness problems
Layer Count Matters: Intervention on more MLP layers yields better performance
The paper cites important works from multiple related fields, including adversarial training, machine unlearning, and LLM safety, providing a solid theoretical foundation and comparison benchmarks for this research.
Overall Assessment: This is an important contribution to the LLM safety domain, addressing both unlearning and robustness problems through a unified constrained optimization framework while providing computationally efficient solutions. Despite some limitations in theoretical analysis and evaluation comprehensiveness, its practical value and innovation make it a significant advance in the field.