2025-11-25T22:19:18.206879

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Rezkellah, Dakhmouche

With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.

academic

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Basic Information

Paper ID: 2510.03567
Title: Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
Authors: Fatmazohra Rezkellah (Université Paris-Dauphine), Ramzi Dakhmouche (EPFL & Empa)
Classification: cs.LG cs.CL cs.CR cs.CY math.OC
Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Constrained Optimization for Machine Learning (COML)
Paper Link: https://arxiv.org/abs/2510.03567

Abstract

With the widespread adoption of Large Language Models (LLMs), greater customization is required to ensure privacy protection and safe generation. This paper addresses this objective from two critical perspectives: forgetting of sensitive information and robustness against jailbreak attacks. The researchers propose various constrained optimization formulations that unify both aspects by finding the minimal possible interventions on LLM weights to make a given vocabulary set unreachable or enhance LLM robustness to customized attacks by shifting partial weights to safer regions. The approach requires no oracle classifier, which is typically unavailable or represents significant computational overhead. Remarkably, the authors find that the simplest point-constrained intervention method outperforms the more complex max-min intervention in terms of performance while incurring lower computational costs.

Research Background and Motivation

Problem Definition

This research addresses two core problems:

Machine Unlearning Problem: How to remove certain information (specific vocabulary sets) from a language model's generation space with minimal computational cost
Adversarial Robustness Problem: How to make language models more robust to jailbreak adversarial attacks that lead to dangerous or toxic content

Significance

With the deployment of LLMs in safety-sensitive applications (such as online content moderation and confidential data processing), ensuring the safety of generative model outputs has become a critical requirement. Existing methods face trade-offs between computational efficiency and defense effectiveness.

Limitations of Existing Methods

Fine-tuning and Model Enhancement: High computational overhead
Prompt-based Defense: Fragile and susceptible to adversarial manipulation
Lightweight Probe Methods: Limited by scarce training data and ineffective against adversarial attacks
Unlearning Methods: Primarily rely on partial retraining through teacher-student frameworks or iterative fine-tuning, incurring high computational costs

Research Motivation

Inspired by principled robustness methods in regression, the authors propose a unified framework addressing both adversarial robustness and unlearning by leveraging the fact that information is implicitly stored along latent space pathways.

Core Contributions

Unified Framework: Proposes and solves various constrained optimization problems enabling LLMs to simultaneously possess adversarial attack robustness and the ability to forget unnecessary content
No External Classifier Required: Overcomes the need for artificial probes by introducing continuous relaxation over the prompt space and performing direct constrained interventions on concept embeddings
Performance Improvement: Demonstrates performance gains compared to state-of-the-art defense algorithms and establishes new state-of-the-art results for economical unlearning on LLMs
Computational Efficiency: The simplest point-constrained method outperforms complex max-min interventions in both performance and computational cost

Methodology Details

Task Definition

Given a trained language model ℓ : Σ → Σ, consider two fundamental safety-related tasks:

How to remove certain information (vocabulary sets) from ℓ's generation space with minimal computational cost
How to make ℓ more robust to jailbreak adversarial attacks that lead to dangerous or toxic content

Three Constrained Intervention Methods

1. Toward Safe Region (TSR)

Seeks minimal weight perturbations to maximize the probability of safe responses to jailbreak prompts:

min_{‖δ‖≤ε} L_safety(ℓ_{θ+δ}(x), y_safe)

where the safety loss function is defined as:

L_safety(f_{θ+δ}(x), y_safe) = -log(∑_{k∈K_safety} p_k(x; θ + δ))

Advantages: Does not require examples of harmful generation; solvable via projected gradient descent Disadvantages: Constraints on safe generation are soft constraints, resulting in weaker performance

2. Away from Risk Region (ARR)

Adopts a max-min formulation:

max_{‖δ‖≤ε} min_{x∈X} L_harmful(ℓ_{θ+δ}(x), y_harmful)

The harmful loss function is defined as:

L_harmful(ℓ_{θ+δ}(x), y_harmful) = -log(∑_{k∈K_harmful} p_k(x; θ + δ))

Characteristics: Considers worst-case input scenarios; uses probability relaxation to handle discrete structures Disadvantages: Requires knowledge of harmful concept sets; may be overly conservative

3. Point-Constrained Region (PCR)

A simple point-constraint strategy based on minimal intervention, ensuring LLM MLP activations do not equal dangerous output embeddings for jailbreak prompts:

min_{θ^{(l)}∈R^{d_l}} ‖δ_l‖_2^2
subject to ‖o^{(l)}(x; θ + δ_l) - c_i‖_2 ≥ ε, ∀i ≤ n

Advantages: Semi-closed-form solution based on KKT conditions; high computational efficiency; best performance Disadvantages: Requires predefined forbidden concept set C

Closed-Form Solution

For the single-constraint case, the closed-form solution is:

δ^{(l)*}_{single} = [ε - ‖r_i‖_2]_+ / ‖h_{intermediate}‖_2^2 * r_i h^T_{intermediate} / ‖r_i‖_2

The multi-constraint case employs an iterative algorithm addressing the most violated constraints.

Experimental Setup

Datasets

Custom Obedience Dataset: Contains 100 forbidden keywords (such as "abuse", "attack", "bomb" and other violence and crime-related vocabulary)
HarmBench: Standard LLM defense benchmark dataset

Evaluation Metrics

Attack Success Rate (ASR): Measures the success degree of adversarial attacks (lower is better)
Refusal Rate: Proportion of cases where the model completely refuses to respond (higher is better)
Perplexity: Measures unlearning level by comparing perplexity of given sequences before and after intervention

Comparison Methods

SmoothLLM: State-of-the-art adversarial defense algorithm
Self-reminder: Self-reminder defense method
Undefended Baseline: Original model

Test Models

Llama-3.1 8B Instruct
Mistral 7B v0.2
Gemma 2B-IT

Experimental Results

Main Results

Adversarial Robustness Results

Attack Success Rate on HarmBench dataset:

Model	Undefended	Point-Constrained (This Work)	SmoothLLM	Self-Reminder
Llama-3.1 8B	11.0	0.0	7.245	0.8
Mistral 7B	30.0	5.88	18.9	28.5
Gemma 2B-IT	22.0	2.508	8.225	19.58

Refusal Pattern Analysis:

Model	This Work (%)	SmoothLLM (%)	Self-Reminder (%)
Llama-3.1 8B	100.0	87.5	24.3
Gemma 2B-IT	97.4	10	36.9
Mistral 7B	26.7	37.5	20

Machine Unlearning Results

Forbidden Word Perplexity Analysis (higher perplexity indicates better unlearning):

Model	Dataset	Baseline	Point-Constrained Intervention
Gemma-2B-IT	Obedience	8.816	12.72
Gemma-2B-IT	HarmBench	16.757	18.157
Mistral-7B	Obedience	8.627	13.74
Llama-3-8B	Obedience	6.48	7.735

Computational Efficiency

Average time per test case:

Model	Attack Time (s)	PCR Method (s)	SmoothLLM (s)
LLaMA 3 8B	38.89	20.16	36.12
Mistral-7B	27.43	17.28	40.17
Gemma 2B	14.375	10.44	11.62

Key Findings

Point-Constrained Method is Optimal: The simplest PCR method outperforms more complex TSR and ARR methods in both performance and computational efficiency
Unified Framework is Effective: A single method can simultaneously handle both unlearning and robustness problems
Layer Count Matters: Intervention on more MLP layers yields better performance
Clear Computational Advantage: Significantly reduces computational overhead compared to existing methods

Safe Generation Methods

Fine-tuning Methods: High computational overhead
Prompt Engineering: Susceptible to adversarial manipulation
Uncertainty Quantification: Computationally complex
Model Enhancement: High resource requirements

Lightweight Methods

Activation Space Probes: Limited by scarce training data
Adversarial Detection: Analyzes statistical features of perturbed inputs

Machine Unlearning

Teacher-Student Framework: Partial retraining with high computational costs
Iterative Fine-tuning: Faces similar computational challenges

Conclusions and Discussion

Main Conclusions

Proposes a constrained optimization framework that unifies LLM unlearning and robustness
The point-constrained method achieves the best balance between simplicity and effectiveness
Eliminates the need for external classifiers, reducing computational overhead and implementation complexity
Surpasses existing state-of-the-art methods on multiple benchmarks

Limitations

Concept Set Dependency: PCR and ARR methods require predefined forbidden concept sets
Evaluation Metrics: Unlearning evaluation is primarily based on perplexity, which may be insufficient
Generalization Capability: Generalization across different attack types and models requires further verification
Theoretical Analysis: Lacks in-depth analysis of theoretical guarantees for the methods

Future Directions

Develop adaptive methods that do not require predefined concept sets
Explore more comprehensive unlearning evaluation metrics
Investigate scalability of methods to larger-scale models
Provide theoretical convergence and safety guarantees

In-Depth Evaluation

Strengths

Problem Importance: Addresses two critical problems in LLM safe deployment
Methodological Innovation: First to unify unlearning and robustness within a constrained optimization framework
Practical Value: Provides computationally efficient solutions
Comprehensive Experiments: Thorough evaluation across multiple models and datasets
Theoretical Foundation: Provides closed-form solutions based on KKT conditions

Weaknesses

Insufficient Theoretical Analysis: Lacks analysis of method convergence and optimality
Evaluation Limitations: Unlearning evaluation relies primarily on a single perplexity metric
Attack Diversity: Primarily targets specific jailbreak attack types; effectiveness on other attack types remains unknown
Long-term Impact: Effects of weight interventions on long-term model performance require further investigation

Impact

Academic Contribution: Provides a new unified perspective for LLM safety research
Practical Value: Offers economical safety solutions for resource-limited organizations
Reproducibility: Provides detailed algorithm descriptions and implementation details
Extensibility: Framework is extensible to other safety-related tasks

Applicable Scenarios

Education: Preventing generation of inappropriate content
Healthcare: Protecting sensitive medical information
Online Platforms: Content safety moderation
Enterprise Applications: Confidential information protection

References

The paper cites important works from multiple related fields, including adversarial training, machine unlearning, and LLM safety, providing a solid theoretical foundation and comparison benchmarks for this research.

Overall Assessment: This is an important contribution to the LLM safety domain, addressing both unlearning and robustness problems through a unified constrained optimization framework while providing computationally efficient solutions. Despite some limitations in theoretical analysis and evaluation comprehensiveness, its practical value and innovation make it a significant advance in the field.