2025-11-17T15:13:20.278531

Backdoor Unlearning by Linear Task Decomposition

Abdelraheem, Favero, Bovet et al.

Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.

academic

Backdoor Unlearning by Linear Task Decomposition

Basic Information

Paper ID: 2510.14845
Title: Backdoor Unlearning by Linear Task Decomposition
Authors: Amel Abdelraheem, Alessandro Favero, Gérôme Bovet, Pascal Frossard
Classification: cs.LG cs.CV
Publication Date/Venue: arXiv preprint (submitted October 16, 2025)
Paper Link: https://arxiv.org/abs/2510.14845

Abstract

Research Background and Motivation

Problem Definition

This research addresses the defense against backdoor attacks in large-scale foundation models. Backdoor attacks inject a small number of samples with specific triggers into training data, causing the model to exhibit predetermined malicious behavior when encountering inputs containing the trigger, while maintaining normal performance on clean inputs.

Problem Significance

Security Threat: Backdoor attacks pose serious threats to safety-critical applications such as autonomous driving and medical diagnosis
Scale Challenge: The training cost of large foundation models is prohibitively high, making complete retraining to eliminate backdoors impractical
Generalization Requirement: Existing defense methods often damage model performance on other tasks, suffering from catastrophic forgetting

Limitations of Existing Methods

Retraining Approaches: Computationally prohibitive for large-scale models
Fine-tuning Methods: Prone to catastrophic forgetting, degrading performance on clean tasks
Traditional Machine Unlearning: Limited effectiveness in backdoor removal, particularly poor performance in small-scale settings

Research Motivation

The authors build upon weight disentanglement theory, hypothesizing that backdoor behavior is separated from normal tasks in the model weight space, enabling precise backdoor removal through linear operations without affecting normal functionality.

Core Contributions

Theoretical Insight: First application of weight disentanglement theory to backdoor analysis, proving that backdoor knowledge and clean knowledge are disentangled in the weight space of CLIP-like Transformer models
TBAR Method: Proposes Trigger removal by Backdoor ARithmetic (TBAR), a lightweight backdoor unlearning method based on task vector arithmetic
Superior Performance: Achieves 99% backdoor removal rate with known triggers while maintaining 96% clean accuracy, requiring two orders of magnitude less data than existing methods
Unknown Attack Scenarios: Combined with reverse engineering techniques, successfully removes backdoors even when attacks are unknown, maintaining over 90% clean accuracy

Methodology Details

Task Definition

Given a backdoor-infected model θb, the goal is to remove backdoor behavior (reducing Attack Success Rate ASR to zero) while maximizing model performance on clean data (Clean Accuracy CA).

Core Assumption: Weight Disentanglement

The authors propose a core assumption: visual foundation models' weights satisfy the weight disentanglement property for common backdoor attacks:

f(x;θpre + αcτc + αtτt) = f(x;θpre + αcτc)1(x ∈ Dc) + f(x;θpre + αtτt)1(x ∈ Dt)

Where:

τc: clean task vector
τt: trigger task vector
Dc: clean image domain
Dt: trigger image domain

TBAR Algorithm Flow

1. Trigger Vector Estimation

Fine-tune the infected model using a small-scale unlearning set (containing only trigger samples):

τ̂t = θb+t - θb

2. Backdoor Removal

Remove backdoors through task negation:

θ̂c = θb - ατ̂t

Where α is a scalar coefficient controlling unlearning strength.

3. Coefficient Optimization

Determine optimal α value using grid search on a small-scale validation set.

Extension to Unknown Attack Scenarios

Combined with DECREE reverse engineering method:

Recover proxy triggers from the infected model using DECREE
Infer target labels by probing model responses
Construct proxy trigger sample set
Apply TBAR for backdoor removal

Experimental Setup

Datasets

Single-task Classification: SUN397, CIFAR100, ImageNet-1K
Large-scale Image-Text: 500k subset of Conceptual Captions 3M (CC3M)

Backdoor Attack Types

BadNet: Insert 16×16 random noise blocks at random positions
Blended: Overlay Gaussian perturbations across entire image (8:2 ratio)
WaNet: Apply subtle image distortion transformations
BadCLIP: Patch attacks optimized for CLIP
SIG: Sinusoidal perturbations along horizontal axis
BadMerging: Attacks designed to survive model merging

Evaluation Metrics

Clean Accuracy (CA): Model accuracy on clean data
Attack Success Rate (ASR): Proportion of trigger samples predicted as target label
Weight Disentanglement Error (ξ): Measures prediction differences between combined and separately applied task vectors

Comparison Methods

Clean Data Fine-tuning: CleanCLIP, RoCLIP, standard CLIP fine-tuning
Machine Unlearning: Gradient Ascent
Reverse Engineering: DECREE

Experimental Results

Main Results

Single-task Classification Experiments

Results on CLIP ViT-B/32:

SUN397: ASR reduced from 91.40% to 1.25%, CA maintained at 94.96%
CIFAR100: ASR reduced from 99.96% to 0.02%, CA maintained at 96.44%
ImageNet-1K: ASR reduced from 93.56% to 1.96%, CA maintained at 94.97%

Large-scale Image-Text Experiments

Results using CC3M dataset:

Data Efficiency: TBAR requires only 1.5k samples versus 100k for baseline methods
Performance Advantage: Outperforms existing defense methods across all attack types
BadCLIP Attack: ASR reduced from 99.98% to 0.77%, CA maintained at 56.58%

Weight Disentanglement Verification

Visualization of weight disentanglement error ξ(αc, αt) confirms that clean and trigger tasks are indeed separated in weight space, validating the core assumption.

Transfer Experiments

TBAR vectors trained on ImageNet-1K remain effective on CIFAR100 and SUN397:

CIFAR100: Shared trigger and target label, 99.98% ASR removal rate
SUN397: Shared trigger only, 98.91% ASR removal rate

Unknown Attack Scenarios

Results combined with DECREE:

BadNet: ASR reduced from 84.48% to 0.33%, CA maintained at 60.29%
WaNet: ASR reduced from 93.12% to 0.64%, CA maintained at 56.85%

Theoretical Verification: Confirms disentanglement of backdoor behavior from normal tasks in weight space
Method Effectiveness: TBAR demonstrates superior performance across various attacks and settings
Practical Value: Significantly reduces data and computational requirements for backdoor defense

Limitations

Assumption Dependency: Method relies on weight disentanglement assumption, potentially inapplicable to all model architectures
Attack Types: Primarily validated on standard attacks; robustness against more complex attacks requires further investigation
DECREE Dependency: Unknown attack scenarios depend on DECREE's detection capability, with limited effectiveness against certain attacks (e.g., BadCLIP)

Future Directions

Extend to other model architectures and pretraining paradigms
Study defense against more complex adaptive attacks
Explore weight disentanglement applications in other security tasks

In-Depth Evaluation

Strengths

Theoretical Innovation: First systematic application of weight disentanglement theory to backdoor defense, providing new theoretical perspective
Method Simplicity: TBAR is simple and effective, easy to implement and deploy
Comprehensive Experiments: Covers multiple attack types, datasets, and model architectures with thorough experimental design
Practical Value: Significantly reduces data requirements, important for real-world deployment

Weaknesses

Theoretical Limitations: Universality of weight disentanglement assumption requires more theoretical analysis
Attack Adaptability: Insufficient consideration of adaptive attacks targeting this defense method
Computational Analysis: Lacks detailed computational complexity analysis and comparison

Impact

Academic Value: Provides new perspectives for backdoor defense research, potentially inspiring more weight-space-based defense methods
Practical Value: Important application prospects in large-scale model deployment
Reproducibility: Provides detailed experimental settings and implementation details for reproducibility

Applicable Scenarios

Large-scale Model Deployment: Particularly suitable for large foundation models that cannot be retrained
Resource-Constrained Environments: Scenarios with limited data and computational resources
Multi-task Models: Applications requiring preservation of multi-task performance

References

The paper cites important works in the field, including:

Ilharco et al. (2022): Pioneering work on task arithmetic
Ortiz-Jimenez et al. (2024): Theoretical foundation of weight disentanglement
Bansal et al. (2023): Benchmark methods for CLIP backdoor defense
Carlini & Terzis (2021): Classical research on CLIP backdoor attacks