2025-11-25T22:19:18.206879

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Rezkellah, Dakhmouche
With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.
academic

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Basic Information

  • Paper ID: 2510.03567
  • Title: Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
  • Authors: Fatmazohra Rezkellah (Université Paris-Dauphine), Ramzi Dakhmouche (EPFL & Empa)
  • Classification: cs.LG cs.CL cs.CR cs.CY math.OC
  • Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Constrained Optimization for Machine Learning (COML)
  • Paper Link: https://arxiv.org/abs/2510.03567

Abstract

With the widespread adoption of Large Language Models (LLMs), greater customization is required to ensure privacy protection and safe generation. This paper addresses this objective from two critical perspectives: forgetting of sensitive information and robustness against jailbreak attacks. The researchers propose various constrained optimization formulations that unify both aspects by finding the minimal possible interventions on LLM weights to make a given vocabulary set unreachable or enhance LLM robustness to customized attacks by shifting partial weights to safer regions. The approach requires no oracle classifier, which is typically unavailable or represents significant computational overhead. Remarkably, the authors find that the simplest point-constrained intervention method outperforms the more complex max-min intervention in terms of performance while incurring lower computational costs.

Research Background and Motivation

Problem Definition

This research addresses two core problems:

  1. Machine Unlearning Problem: How to remove certain information (specific vocabulary sets) from a language model's generation space with minimal computational cost
  2. Adversarial Robustness Problem: How to make language models more robust to jailbreak adversarial attacks that lead to dangerous or toxic content

Significance

With the deployment of LLMs in safety-sensitive applications (such as online content moderation and confidential data processing), ensuring the safety of generative model outputs has become a critical requirement. Existing methods face trade-offs between computational efficiency and defense effectiveness.

Limitations of Existing Methods

  1. Fine-tuning and Model Enhancement: High computational overhead
  2. Prompt-based Defense: Fragile and susceptible to adversarial manipulation
  3. Lightweight Probe Methods: Limited by scarce training data and ineffective against adversarial attacks
  4. Unlearning Methods: Primarily rely on partial retraining through teacher-student frameworks or iterative fine-tuning, incurring high computational costs

Research Motivation

Inspired by principled robustness methods in regression, the authors propose a unified framework addressing both adversarial robustness and unlearning by leveraging the fact that information is implicitly stored along latent space pathways.

Core Contributions

  1. Unified Framework: Proposes and solves various constrained optimization problems enabling LLMs to simultaneously possess adversarial attack robustness and the ability to forget unnecessary content
  2. No External Classifier Required: Overcomes the need for artificial probes by introducing continuous relaxation over the prompt space and performing direct constrained interventions on concept embeddings
  3. Performance Improvement: Demonstrates performance gains compared to state-of-the-art defense algorithms and establishes new state-of-the-art results for economical unlearning on LLMs
  4. Computational Efficiency: The simplest point-constrained method outperforms complex max-min interventions in both performance and computational cost

Methodology Details

Task Definition

Given a trained language model ℓ : Σ → Σ, consider two fundamental safety-related tasks:

  1. How to remove certain information (vocabulary sets) from ℓ's generation space with minimal computational cost
  2. How to make ℓ more robust to jailbreak adversarial attacks that lead to dangerous or toxic content

Three Constrained Intervention Methods

1. Toward Safe Region (TSR)

Seeks minimal weight perturbations to maximize the probability of safe responses to jailbreak prompts:

min_{‖δ‖≤ε} L_safety(ℓ_{θ+δ}(x), y_safe)

where the safety loss function is defined as:

L_safety(f_{θ+δ}(x), y_safe) = -log(∑_{k∈K_safety} p_k(x; θ + δ))

Advantages: Does not require examples of harmful generation; solvable via projected gradient descent Disadvantages: Constraints on safe generation are soft constraints, resulting in weaker performance

2. Away from Risk Region (ARR)

Adopts a max-min formulation:

max_{‖δ‖≤ε} min_{x∈X} L_harmful(ℓ_{θ+δ}(x), y_harmful)

The harmful loss function is defined as:

L_harmful(ℓ_{θ+δ}(x), y_harmful) = -log(∑_{k∈K_harmful} p_k(x; θ + δ))

Characteristics: Considers worst-case input scenarios; uses probability relaxation to handle discrete structures Disadvantages: Requires knowledge of harmful concept sets; may be overly conservative

3. Point-Constrained Region (PCR)

A simple point-constraint strategy based on minimal intervention, ensuring LLM MLP activations do not equal dangerous output embeddings for jailbreak prompts:

min_{θ^{(l)}∈R^{d_l}} ‖δ_l‖_2^2
subject to ‖o^{(l)}(x; θ + δ_l) - c_i‖_2 ≥ ε, ∀i ≤ n

Advantages: Semi-closed-form solution based on KKT conditions; high computational efficiency; best performance Disadvantages: Requires predefined forbidden concept set C

Closed-Form Solution

For the single-constraint case, the closed-form solution is:

δ^{(l)*}_{single} = [ε - ‖r_i‖_2]_+ / ‖h_{intermediate}‖_2^2 * r_i h^T_{intermediate} / ‖r_i‖_2

The multi-constraint case employs an iterative algorithm addressing the most violated constraints.

Experimental Setup

Datasets

  1. Custom Obedience Dataset: Contains 100 forbidden keywords (such as "abuse", "attack", "bomb" and other violence and crime-related vocabulary)
  2. HarmBench: Standard LLM defense benchmark dataset

Evaluation Metrics

  1. Attack Success Rate (ASR): Measures the success degree of adversarial attacks (lower is better)
  2. Refusal Rate: Proportion of cases where the model completely refuses to respond (higher is better)
  3. Perplexity: Measures unlearning level by comparing perplexity of given sequences before and after intervention

Comparison Methods

  • SmoothLLM: State-of-the-art adversarial defense algorithm
  • Self-reminder: Self-reminder defense method
  • Undefended Baseline: Original model

Test Models

  • Llama-3.1 8B Instruct
  • Mistral 7B v0.2
  • Gemma 2B-IT

Experimental Results

Main Results

Adversarial Robustness Results

Attack Success Rate on HarmBench dataset:

ModelUndefendedPoint-Constrained (This Work)SmoothLLMSelf-Reminder
Llama-3.1 8B11.00.07.2450.8
Mistral 7B30.05.8818.928.5
Gemma 2B-IT22.02.5088.22519.58

Refusal Pattern Analysis:

ModelThis Work (%)SmoothLLM (%)Self-Reminder (%)
Llama-3.1 8B100.087.524.3
Gemma 2B-IT97.41036.9
Mistral 7B26.737.520

Machine Unlearning Results

Forbidden Word Perplexity Analysis (higher perplexity indicates better unlearning):

ModelDatasetBaselinePoint-Constrained Intervention
Gemma-2B-ITObedience8.81612.72
Gemma-2B-ITHarmBench16.75718.157
Mistral-7BObedience8.62713.74
Llama-3-8BObedience6.487.735

Computational Efficiency

Average time per test case:

ModelAttack Time (s)PCR Method (s)SmoothLLM (s)
LLaMA 3 8B38.8920.1636.12
Mistral-7B27.4317.2840.17
Gemma 2B14.37510.4411.62

Key Findings

  1. Point-Constrained Method is Optimal: The simplest PCR method outperforms more complex TSR and ARR methods in both performance and computational efficiency
  2. Unified Framework is Effective: A single method can simultaneously handle both unlearning and robustness problems
  3. Layer Count Matters: Intervention on more MLP layers yields better performance
  4. Clear Computational Advantage: Significantly reduces computational overhead compared to existing methods

Safe Generation Methods

  1. Fine-tuning Methods: High computational overhead
  2. Prompt Engineering: Susceptible to adversarial manipulation
  3. Uncertainty Quantification: Computationally complex
  4. Model Enhancement: High resource requirements

Lightweight Methods

  1. Activation Space Probes: Limited by scarce training data
  2. Adversarial Detection: Analyzes statistical features of perturbed inputs

Machine Unlearning

  1. Teacher-Student Framework: Partial retraining with high computational costs
  2. Iterative Fine-tuning: Faces similar computational challenges

Conclusions and Discussion

Main Conclusions

  1. Proposes a constrained optimization framework that unifies LLM unlearning and robustness
  2. The point-constrained method achieves the best balance between simplicity and effectiveness
  3. Eliminates the need for external classifiers, reducing computational overhead and implementation complexity
  4. Surpasses existing state-of-the-art methods on multiple benchmarks

Limitations

  1. Concept Set Dependency: PCR and ARR methods require predefined forbidden concept sets
  2. Evaluation Metrics: Unlearning evaluation is primarily based on perplexity, which may be insufficient
  3. Generalization Capability: Generalization across different attack types and models requires further verification
  4. Theoretical Analysis: Lacks in-depth analysis of theoretical guarantees for the methods

Future Directions

  1. Develop adaptive methods that do not require predefined concept sets
  2. Explore more comprehensive unlearning evaluation metrics
  3. Investigate scalability of methods to larger-scale models
  4. Provide theoretical convergence and safety guarantees

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses two critical problems in LLM safe deployment
  2. Methodological Innovation: First to unify unlearning and robustness within a constrained optimization framework
  3. Practical Value: Provides computationally efficient solutions
  4. Comprehensive Experiments: Thorough evaluation across multiple models and datasets
  5. Theoretical Foundation: Provides closed-form solutions based on KKT conditions

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks analysis of method convergence and optimality
  2. Evaluation Limitations: Unlearning evaluation relies primarily on a single perplexity metric
  3. Attack Diversity: Primarily targets specific jailbreak attack types; effectiveness on other attack types remains unknown
  4. Long-term Impact: Effects of weight interventions on long-term model performance require further investigation

Impact

  1. Academic Contribution: Provides a new unified perspective for LLM safety research
  2. Practical Value: Offers economical safety solutions for resource-limited organizations
  3. Reproducibility: Provides detailed algorithm descriptions and implementation details
  4. Extensibility: Framework is extensible to other safety-related tasks

Applicable Scenarios

  1. Education: Preventing generation of inappropriate content
  2. Healthcare: Protecting sensitive medical information
  3. Online Platforms: Content safety moderation
  4. Enterprise Applications: Confidential information protection

References

The paper cites important works from multiple related fields, including adversarial training, machine unlearning, and LLM safety, providing a solid theoretical foundation and comparison benchmarks for this research.


Overall Assessment: This is an important contribution to the LLM safety domain, addressing both unlearning and robustness problems through a unified constrained optimization framework while providing computationally efficient solutions. Despite some limitations in theoretical analysis and evaluation comprehensiveness, its practical value and innovation make it a significant advance in the field.