2025-11-17T15:13:20.278531

Backdoor Unlearning by Linear Task Decomposition

Abdelraheem, Favero, Bovet et al.
Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
academic

Backdoor Unlearning by Linear Task Decomposition

Basic Information

  • Paper ID: 2510.14845
  • Title: Backdoor Unlearning by Linear Task Decomposition
  • Authors: Amel Abdelraheem, Alessandro Favero, Gérôme Bovet, Pascal Frossard
  • Classification: cs.LG cs.CV
  • Publication Date/Venue: arXiv preprint (submitted October 16, 2025)
  • Paper Link: https://arxiv.org/abs/2510.14845

Abstract

Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.

Research Background and Motivation

Problem Definition

This research addresses the defense against backdoor attacks in large-scale foundation models. Backdoor attacks inject a small number of samples with specific triggers into training data, causing the model to exhibit predetermined malicious behavior when encountering inputs containing the trigger, while maintaining normal performance on clean inputs.

Problem Significance

  1. Security Threat: Backdoor attacks pose serious threats to safety-critical applications such as autonomous driving and medical diagnosis
  2. Scale Challenge: The training cost of large foundation models is prohibitively high, making complete retraining to eliminate backdoors impractical
  3. Generalization Requirement: Existing defense methods often damage model performance on other tasks, suffering from catastrophic forgetting

Limitations of Existing Methods

  1. Retraining Approaches: Computationally prohibitive for large-scale models
  2. Fine-tuning Methods: Prone to catastrophic forgetting, degrading performance on clean tasks
  3. Traditional Machine Unlearning: Limited effectiveness in backdoor removal, particularly poor performance in small-scale settings

Research Motivation

The authors build upon weight disentanglement theory, hypothesizing that backdoor behavior is separated from normal tasks in the model weight space, enabling precise backdoor removal through linear operations without affecting normal functionality.

Core Contributions

  1. Theoretical Insight: First application of weight disentanglement theory to backdoor analysis, proving that backdoor knowledge and clean knowledge are disentangled in the weight space of CLIP-like Transformer models
  2. TBAR Method: Proposes Trigger removal by Backdoor ARithmetic (TBAR), a lightweight backdoor unlearning method based on task vector arithmetic
  3. Superior Performance: Achieves 99% backdoor removal rate with known triggers while maintaining 96% clean accuracy, requiring two orders of magnitude less data than existing methods
  4. Unknown Attack Scenarios: Combined with reverse engineering techniques, successfully removes backdoors even when attacks are unknown, maintaining over 90% clean accuracy

Methodology Details

Task Definition

Given a backdoor-infected model θb, the goal is to remove backdoor behavior (reducing Attack Success Rate ASR to zero) while maximizing model performance on clean data (Clean Accuracy CA).

Core Assumption: Weight Disentanglement

The authors propose a core assumption: visual foundation models' weights satisfy the weight disentanglement property for common backdoor attacks:

f(x;θpre + αcτc + αtτt) = f(x;θpre + αcτc)1(x ∈ Dc) + f(x;θpre + αtτt)1(x ∈ Dt)

Where:

  • τc: clean task vector
  • τt: trigger task vector
  • Dc: clean image domain
  • Dt: trigger image domain

TBAR Algorithm Flow

1. Trigger Vector Estimation

Fine-tune the infected model using a small-scale unlearning set (containing only trigger samples):

τ̂t = θb+t - θb

2. Backdoor Removal

Remove backdoors through task negation:

θ̂c = θb - ατ̂t

Where α is a scalar coefficient controlling unlearning strength.

3. Coefficient Optimization

Determine optimal α value using grid search on a small-scale validation set.

Extension to Unknown Attack Scenarios

Combined with DECREE reverse engineering method:

  1. Recover proxy triggers from the infected model using DECREE
  2. Infer target labels by probing model responses
  3. Construct proxy trigger sample set
  4. Apply TBAR for backdoor removal

Experimental Setup

Datasets

  1. Single-task Classification: SUN397, CIFAR100, ImageNet-1K
  2. Large-scale Image-Text: 500k subset of Conceptual Captions 3M (CC3M)

Backdoor Attack Types

  • BadNet: Insert 16×16 random noise blocks at random positions
  • Blended: Overlay Gaussian perturbations across entire image (8:2 ratio)
  • WaNet: Apply subtle image distortion transformations
  • BadCLIP: Patch attacks optimized for CLIP
  • SIG: Sinusoidal perturbations along horizontal axis
  • BadMerging: Attacks designed to survive model merging

Evaluation Metrics

  • Clean Accuracy (CA): Model accuracy on clean data
  • Attack Success Rate (ASR): Proportion of trigger samples predicted as target label
  • Weight Disentanglement Error (ξ): Measures prediction differences between combined and separately applied task vectors

Comparison Methods

  • Clean Data Fine-tuning: CleanCLIP, RoCLIP, standard CLIP fine-tuning
  • Machine Unlearning: Gradient Ascent
  • Reverse Engineering: DECREE

Experimental Results

Main Results

Single-task Classification Experiments

Results on CLIP ViT-B/32:

  • SUN397: ASR reduced from 91.40% to 1.25%, CA maintained at 94.96%
  • CIFAR100: ASR reduced from 99.96% to 0.02%, CA maintained at 96.44%
  • ImageNet-1K: ASR reduced from 93.56% to 1.96%, CA maintained at 94.97%

Large-scale Image-Text Experiments

Results using CC3M dataset:

  • Data Efficiency: TBAR requires only 1.5k samples versus 100k for baseline methods
  • Performance Advantage: Outperforms existing defense methods across all attack types
  • BadCLIP Attack: ASR reduced from 99.98% to 0.77%, CA maintained at 56.58%

Weight Disentanglement Verification

Visualization of weight disentanglement error ξ(αc, αt) confirms that clean and trigger tasks are indeed separated in weight space, validating the core assumption.

Transfer Experiments

TBAR vectors trained on ImageNet-1K remain effective on CIFAR100 and SUN397:

  • CIFAR100: Shared trigger and target label, 99.98% ASR removal rate
  • SUN397: Shared trigger only, 98.91% ASR removal rate

Unknown Attack Scenarios

Results combined with DECREE:

  • BadNet: ASR reduced from 84.48% to 0.33%, CA maintained at 60.29%
  • WaNet: ASR reduced from 93.12% to 0.64%, CA maintained at 56.85%

Ablation Studies

Unlearning Set Size Impact

Experiments show limited performance gains from increasing unlearning set size (300 to 30k), indicating that precisely identifying what to unlearn is more important than data scale.

Clean-Trigger Data Ratio

Using different proportions of mixed clean and trigger data, results show pure trigger data achieves optimal CA-ASR tradeoff.

Data Poisoning Attacks

Backdoor attacks are a form of data poisoning, introducing hidden vulnerabilities into models by modifying small amounts of training data. Multimodal models like CLIP are primary targets due to their widespread application.

Machine Unlearning

Machine unlearning aims to selectively remove specific learned behaviors, categorized into exact and approximate unlearning. Existing methods show limited effectiveness in backdoor removal tasks.

Weight Interpolation and Task Arithmetic

Task arithmetic encodes learning tasks as vectors in weight space, enabling task addition, removal, and combination through linear operations. Weight disentanglement is the theoretical foundation for these operations' effectiveness.

Conclusions and Discussion

Main Conclusions

  1. Theoretical Verification: Confirms disentanglement of backdoor behavior from normal tasks in weight space
  2. Method Effectiveness: TBAR demonstrates superior performance across various attacks and settings
  3. Practical Value: Significantly reduces data and computational requirements for backdoor defense

Limitations

  1. Assumption Dependency: Method relies on weight disentanglement assumption, potentially inapplicable to all model architectures
  2. Attack Types: Primarily validated on standard attacks; robustness against more complex attacks requires further investigation
  3. DECREE Dependency: Unknown attack scenarios depend on DECREE's detection capability, with limited effectiveness against certain attacks (e.g., BadCLIP)

Future Directions

  1. Extend to other model architectures and pretraining paradigms
  2. Study defense against more complex adaptive attacks
  3. Explore weight disentanglement applications in other security tasks

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: First systematic application of weight disentanglement theory to backdoor defense, providing new theoretical perspective
  2. Method Simplicity: TBAR is simple and effective, easy to implement and deploy
  3. Comprehensive Experiments: Covers multiple attack types, datasets, and model architectures with thorough experimental design
  4. Practical Value: Significantly reduces data requirements, important for real-world deployment

Weaknesses

  1. Theoretical Limitations: Universality of weight disentanglement assumption requires more theoretical analysis
  2. Attack Adaptability: Insufficient consideration of adaptive attacks targeting this defense method
  3. Computational Analysis: Lacks detailed computational complexity analysis and comparison

Impact

  1. Academic Value: Provides new perspectives for backdoor defense research, potentially inspiring more weight-space-based defense methods
  2. Practical Value: Important application prospects in large-scale model deployment
  3. Reproducibility: Provides detailed experimental settings and implementation details for reproducibility

Applicable Scenarios

  1. Large-scale Model Deployment: Particularly suitable for large foundation models that cannot be retrained
  2. Resource-Constrained Environments: Scenarios with limited data and computational resources
  3. Multi-task Models: Applications requiring preservation of multi-task performance

References

The paper cites important works in the field, including:

  • Ilharco et al. (2022): Pioneering work on task arithmetic
  • Ortiz-Jimenez et al. (2024): Theoretical foundation of weight disentanglement
  • Bansal et al. (2023): Benchmark methods for CLIP backdoor defense
  • Carlini & Terzis (2021): Classical research on CLIP backdoor attacks