Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
This research addresses the defense against backdoor attacks in large-scale foundation models. Backdoor attacks inject a small number of samples with specific triggers into training data, causing the model to exhibit predetermined malicious behavior when encountering inputs containing the trigger, while maintaining normal performance on clean inputs.
The authors build upon weight disentanglement theory, hypothesizing that backdoor behavior is separated from normal tasks in the model weight space, enabling precise backdoor removal through linear operations without affecting normal functionality.
Theoretical Insight: First application of weight disentanglement theory to backdoor analysis, proving that backdoor knowledge and clean knowledge are disentangled in the weight space of CLIP-like Transformer models
TBAR Method: Proposes Trigger removal by Backdoor ARithmetic (TBAR), a lightweight backdoor unlearning method based on task vector arithmetic
Superior Performance: Achieves 99% backdoor removal rate with known triggers while maintaining 96% clean accuracy, requiring two orders of magnitude less data than existing methods
Unknown Attack Scenarios: Combined with reverse engineering techniques, successfully removes backdoors even when attacks are unknown, maintaining over 90% clean accuracy
Given a backdoor-infected model θb, the goal is to remove backdoor behavior (reducing Attack Success Rate ASR to zero) while maximizing model performance on clean data (Clean Accuracy CA).
Visualization of weight disentanglement error ξ(αc, αt) confirms that clean and trigger tasks are indeed separated in weight space, validating the core assumption.
Experiments show limited performance gains from increasing unlearning set size (300 to 30k), indicating that precisely identifying what to unlearn is more important than data scale.
Backdoor attacks are a form of data poisoning, introducing hidden vulnerabilities into models by modifying small amounts of training data. Multimodal models like CLIP are primary targets due to their widespread application.
Machine unlearning aims to selectively remove specific learned behaviors, categorized into exact and approximate unlearning. Existing methods show limited effectiveness in backdoor removal tasks.
Task arithmetic encodes learning tasks as vectors in weight space, enabling task addition, removal, and combination through linear operations. Weight disentanglement is the theoretical foundation for these operations' effectiveness.
Assumption Dependency: Method relies on weight disentanglement assumption, potentially inapplicable to all model architectures
Attack Types: Primarily validated on standard attacks; robustness against more complex attacks requires further investigation
DECREE Dependency: Unknown attack scenarios depend on DECREE's detection capability, with limited effectiveness against certain attacks (e.g., BadCLIP)