Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
è«æID : 2510.14845ã¿ã€ãã« : Backdoor Unlearning by Linear Task Decompositionèè
: Amel Abdelraheem, Alessandro Favero, GérÃŽme Bovet, Pascal Frossardåé¡ : cs.LG cs.CVçºè¡šææ/äŒè° : arXiv ãã¬ããªã³ã (2025幎10æ16æ¥æåº)è«æãªã³ã¯ : https://arxiv.org/abs/2510.14845 åºç€ã¢ãã«(Foundation Models)ã¯ã³ã³ãã¥ãŒã¿ããžã§ã³ã«é©åœãããããã倿§ãªã¿ã¹ã¯éã§ã®åºç¯ãªæ±åãå®çŸããŸããããããããããã¯æµå¯Ÿçæåãšæšçåããã¯ãã¢æ»æã«å¯ŸããŠæ¥µããŠè匱ãªãŸãŸã§ããç¹ã«ã¢ãã«ã®å€§èп𡿧ã«ããå®å
šæ§ç¢ºä¿ã®ããã®åèšç·ŽãçŠæ¢ãããŠããããããã®ãããªè匱æ§ã®è»œæžã¯æªè§£æ±ºã®èª²é¡ã§ããæ¢åã®ããã¯ãã¢é€å»ã¢ãããŒãã¯æå®³ãªåäœãäžæžãããããã®é«ã³ã¹ããªãã¡ã€ã³ãã¥ãŒãã³ã°ã«äŸåããŠããããã°ãã°ä»ã®ç¡é¢ä¿ãªã¿ã¹ã¯ã§ã®ããã©ãŒãã³ã¹ãäœäžãããŸããæ¬ç ç©¶ã§ã¯ãããã¯ãã¢ãé€å»ããªããã¢ãã«ã®äžè¬çãªèœåãæãªããªãããšãå¯èœããšããåé¡ã«åãçµã¿ãŸããããã¯ãã¢ãã¢ãã«ã®éã¿ã®ç©ºéã«ã©ã®ããã«ãšã³ã³ãŒããããŠãããã調æ»ããããããä»ã®è¯æ§ã¿ã¹ã¯ããåé¢ãããŠããããšãçºèŠããŸãããç¹ã«ããã®åé¢ã«ããããã¯ãã¢ã®åœ±é¿ãåé¢ããŠæ¶å»ã§ããã¯ãªãŒã³ãªããã©ãŒãã³ã¹ãžã®åœ±é¿ã¯æå°éã«æããããŸãããã®ç¥èŠã«åºã¥ãããã®ãããªåé¢ã掻çšããåçŽãªåŠç¿è§£é€æ¹æ³ãå°å
¥ããŸããCLIPããŒã¹ã®ã¢ãã«ãšäžè¬çãªæµå¯Ÿçããªã¬ãŒãçšããåºç¯ãªå®éšãéããŠãæ»æã®ç¥èãäžããããå Žåãæ¬ææ³ã¯ã»ãŒå®å
šãªåŠç¿è§£é€ãéæããªãããå¹³åããŠã¯ãªãŒã³ç²ŸåºŠã®96%ãä¿æããããšã瀺ããŸããããã«ãæ»æãšãã®ååšãäžæãªå Žåã§ããéãšã³ãžãã¢ãªã³ã°ãããããªã¬ãŒãçšããé©åãªæšå®ã«ãããæ¬ææ³ã¯ããã¯ãã¢ãæåè£ã«åŠç¿è§£é€ããããšãå®èšŒããŸããå
šäœãšããŠãæ¬ææ³ã¯çŸåšã®æå
端é²åŸ¡ãšæ¯èŒããŠãäžè²«ããŠããåªããåŠç¿è§£é€ãšã¯ãªãŒã³ç²ŸåºŠã®ãã¬ãŒããªããå®çŸããŸãã
æ¬ç ç©¶ã¯å€§èŠæš¡åºç€ã¢ãã«(Foundation Models)ã«ãããããã¯ãã¢æ»æ(Backdoor Attacks)é²åŸ¡ã®åé¡ã«å¯ŸåŠããŠããŸããããã¯ãã¢æ»æã¯ãèšç·ŽããŒã¿ã«ç¹å®ã®ããªã¬ãŒ(trigger)ãå«ãå°æ°ã®ãµã³ãã«ã泚å
¥ããããšã§ãã¢ãã«ããã®ããªã¬ãŒãå«ãå
¥åã«ééããéã«äºå®ãããæªæã®ããåäœãçããããéåžžã®å
¥åã§ã¯æ£åžžã«æ©èœããããã«ããŸãã
ã»ãã¥ãªãã£è
åš : ããã¯ãã¢æ»æã¯èªåé転ãå»ç蚺æãªã©ã®å®å
šé¢é£ã¢ããªã±ãŒã·ã§ã³ã«æ·±å»ãªè
åšããããããŸãèŠæš¡ã®èª²é¡ : å€§èŠæš¡åºç€ã¢ãã«ã®èšç·Žã³ã¹ãã¯æ¥µããŠé«ããããã¯ãã¢ãæé€ããããã®å®å
šãªåèšç·Žã¯å®è·µçã«äžå¯èœã§ãæ±çšæ§ã®èŠä»¶ : æ¢åã®é²åŸ¡æ¹æ³ã¯ä»ã®ã¿ã¹ã¯ã§ã®ã¢ãã«ã®ããã©ãŒãã³ã¹ãæãªãããšãå€ããç Žæ»
çãªå¿åŽã®åé¡ãååšããŸãåèšç·Žæ¹æ³ : èšç®ã³ã¹ããé«ãããå€§èŠæš¡ã¢ãã«ã«ã¯å®è¡äžå¯èœã§ããã¡ã€ã³ãã¥ãŒãã³ã°æ¹æ³ : ç Žæ»
çãªå¿åŽãåŒãèµ·ããããããã¯ãªãŒã³ãªã¿ã¹ã¯ã§ã®ã¢ãã«ããã©ãŒãã³ã¹ãäœäžãããŸãåŸæ¥ã®æ©æ¢°åŠç¿è§£é€ : ããã¯ãã¢é€å»ã¿ã¹ã¯ã§ã®å¹æã¯éå®çã§ãããç¹ã«å°èŠæš¡èšå®ã§ã®æ§èœãäžååã§ãèè
ã¯éã¿åé¢(weight disentanglement)çè«ã«åºã¥ããããã¯ãã¢åäœãã¢ãã«ã®éã¿ç©ºéã«ãããŠéåžžã®ã¿ã¹ã¯ããåé¢ãããŠãããšãã仮説ãç«ãŠãŠããããããã£ãŠç·åœ¢æäœãéããŠããã¯ãã¢ãæ£ç¢ºã«é€å»ããªããéåžžã®æ©èœã«åœ±é¿ãäžããªãããšãå¯èœã§ãããšèããŠããŸãã
çè«çæŽå¯ : éã¿åé¢çè«ãããã¯ãã¢åæã«åããŠé©çšããCLIPã®ãããªTransformerã¢ãã«ã«ãããŠããã¯ãã¢ç¥èãšã¯ãªãŒã³ç¥èãéã¿ç©ºéã§åé¢ãããŠããããšã蚌æããŸããTBARææ³ : ããªã¬ãŒé€å»ã«ããããã¯ãã¢ç®è¡(Trigger removal by Backdoor ARithmetic, TBAR)ãææ¡ããŸãããããã¯ã¿ã¹ã¯ãã¯ãã«ç®è¡ã«åºã¥ã軜éãªããã¯ãã¢åŠç¿è§£é€æ¹æ³ã§ãåªããããã©ãŒãã³ã¹ : ããªã¬ãŒãæ¢ç¥ã®å Žåã99%ã®ããã¯ãã¢é€å»çãéæããªããã96%ã®ã¯ãªãŒã³ç²ŸåºŠãä¿æããããŒã¿èŠä»¶ã¯æ¢åææ³ãã2æ¡å°ãªããªããŸãæ»ææªç¥ã·ããªãª : éãšã³ãžãã¢ãªã³ã°æè¡ãšçµã¿åãããããšã§ãæ»æãäžæãªå Žåã§ãããã¯ãã¢ãæåè£ã«é€å»ãã90%以äžã®ã¯ãªãŒã³ç²ŸåºŠãä¿æããŸãããã¯ãã¢æ»æã«ææããã¢ãã«Îžbãäžããããå Žåãç®æšã¯ããã¯ãã¢åäœãé€å»ã(æ»ææåçASRããŒãã«äœäžãã)ãåæã«ã¯ãªãŒã³ããŒã¿äžã§ã®ã¢ãã«ã®ããã©ãŒãã³ã¹(ã¯ãªãŒã³ç²ŸåºŠCA)ãã§ããã ãä¿æããããšã§ãã
èè
ã¯äžæ žçä»®èª¬ãææ¡ããŠããŸããèŠèŠåºç€ã¢ãã«ã®éã¿ã¯äžè¬çãªããã¯ãã¢æ»æã«å¯ŸããŠéã¿åé¢ç¹æ§ãæºãããšãããã®ã§ã:
f(x;Ξpre + αcÏc + αtÏt) = f(x;Ξpre + αcÏc)1(x â Dc) + f(x;Ξpre + αtÏt)1(x â Dt)
ããã§:
Ïc: ã¯ãªãŒã³ã¿ã¹ã¯ãã¯ãã« Ït: ããªã¬ãŒã¿ã¹ã¯ãã¯ãã« Dc: ã¯ãªãŒã³ç»åãã¡ã€ã³ Dt: ããªã¬ãŒç»åãã¡ã€ã³ å°èŠæš¡ãªåŠç¿è§£é€ã»ãã(ããªã¬ãŒãµã³ãã«ã®ã¿ãå«ã)ã䜿çšããŠææããã¢ãã«ããã¡ã€ã³ãã¥ãŒãã³ã°ããŸã:
ã¿ã¹ã¯åŠå®(task negation)ãéããŠããã¯ãã¢ãé€å»ããŸã:
ããã§Î±ã¯åŠç¿è§£é€ã®åŒ·åºŠãå¶åŸ¡ããã¹ã«ã©ãŒä¿æ°ã§ãã
å°èŠæš¡ãªæ€èšŒã»ããã䜿çšããŠã°ãªãããµãŒãã«ããæé©ãªÎ±å€ã決å®ããŸãã
DECREEéãšã³ãžãã¢ãªã³ã°æ¹æ³ãšçµã¿åãããŸã:
DECREEã䜿çšããŠææããã¢ãã«ãããããã·ããªã¬ãŒã埩å
ããŸã ã¢ãã«ã®å¿çãæ¢çŽ¢ããããšã§ç®æšã©ãã«ãæšæž¬ããŸã ãããã·ããªã¬ãŒãµã³ãã«ã»ãããæ§ç¯ããŸã ããã¯ãã¢é€å»ã®ããã«TBARãé©çšããŸã åäžã¿ã¹ã¯åé¡ : SUN397ãCIFAR100ãImageNet-1Kå€§èŠæš¡ç»åããã¹ã : Conceptual Captions 3M (CC3M)ã®500kãµãã»ããBadNet : ã©ã³ãã ãªäœçœ®ã«16Ã16ã®ã©ã³ãã ãã€ãºãããã¯ãæ¿å
¥Blended : ç»åå
šäœã«ã¬ãŠã¹æåãéãåãã(8:2æ¯ç)WaNet : 埮åŠãªç»åæªã¿å€æãé©çšBadCLIP : CLIPã«æé©åããããããæ»æSIG : æ°Žå¹³è»žã«æ²¿ã£ãæ£åŒŠæ³¢æåBadMerging : ã¢ãã«ããŒãžåŸã«åç¶ããããã«èšèšãããæ»æã¯ãªãŒã³ç²ŸåºŠ(CA) : ã¯ãªãŒã³ããŒã¿äžã§ã®ã¢ãã«ã®ç²ŸåºŠæ»ææåç(ASR) : ããªã¬ãŒãµã³ãã«ãç®æšã©ãã«ãšããŠäºæž¬ãããæ¯çéã¿åé¢èª€å·®(Ο) : ã¿ã¹ã¯ãã¯ãã«ã®çµã¿åãããšåå¥é©çšã®äºæž¬å·®ç°ã枬å®ã¯ãªãŒã³ããŒã¿ãã¡ã€ã³ãã¥ãŒãã³ã° : CleanCLIPãRoCLIPãæšæºCLIPãã¡ã€ã³ãã¥ãŒãã³ã°æ©æ¢°åŠç¿è§£é€ : åŸé
äžæ(Gradient Ascent)éãšã³ãžãã¢ãªã³ã° : DECREECLIP ViT-B/32ã§ã®çµæã¯ä»¥äžã瀺ããŠããŸã:
SUN397 : ASRã91.40%ãã1.25%ã«äœäžãCAã¯94.96%ãä¿æCIFAR100 : ASRã99.96%ãã0.02%ã«äœäžãCAã¯96.44%ãä¿æImageNet-1K : ASRã93.56%ãã1.96%ã«äœäžãCAã¯94.97%ãä¿æCC3MããŒã¿ã»ããã䜿çšããçµæ:
ããŒã¿å¹ç : TBARã¯ããã1.5kãµã³ãã«ãå¿
èŠãšããããŒã¹ã©ã€ã³ææ³ã¯100kãµã³ãã«ãå¿
èŠã§ãããã©ãŒãã³ã¹åªäœæ§ : ãã¹ãŠã®æ»æã¿ã€ãã§æ¢åã®é²åŸ¡æ¹æ³ãäžåããŸãBadCLIPæ»æ : ASRã99.98%ãã0.77%ã«äœäžãCAã¯56.58%ãä¿æéã¿åé¢èª€å·®ÎŸ(αc, αt)ãå¯èŠåããããšã§ãã¯ãªãŒã³ã¿ã¹ã¯ãšããªã¬ãŒã¿ã¹ã¯ãéã¿ç©ºéã§å®éã«åé¢ãããŠããããšã確èªããäžæ žçä»®èª¬ã®æ£ç¢ºæ§ãæ€èšŒããŸããã
ImageNet-1Kã§èšç·ŽãããTBARãã¯ãã«ã¯CIFAR100ãšSUN397ã§ãæå¹ã§ã:
CIFAR100 : å
±æããªã¬ãŒãšç®æšã©ãã«ãASRé€å»çã¯99.98%ã«éããŸãSUN397 : ããªã¬ãŒã®ã¿å
±æãASRé€å»çã¯98.91%ã«éããŸãDECREEãšçµã¿åãããçµæã¯ä»¥äžã瀺ããŠããŸã:
BadNet : ASRã84.48%ãã0.33%ã«äœäžãCAã¯60.29%ãä¿æWaNet : ASRã93.12%ãã0.64%ã«äœäžãCAã¯56.85%ãä¿æå®éšã¯åŠç¿è§£é€ã»ãããµã€ãºã®å¢å (300ãã30k)ãããã©ãŒãã³ã¹åäžã«éå®çã§ããããšã瀺ããæ£ç¢ºã«åŠç¿è§£é€ããå¿
èŠããããã®ãç¹å®ããããšãããŒã¿èŠæš¡ããéèŠã§ããããšã瀺ããŠããŸãã
ç°ãªãæ¯çã®ã¯ãªãŒã³ããŒã¿ãšããªã¬ãŒããŒã¿ã®æ··åã䜿çšããçµæãçŽç²ãªããªã¬ãŒããŒã¿ãæé©ãªCA-ASRãã¬ãŒããªããåŸãããšã瀺ããŠããŸãã
ããã¯ãã¢æ»æã¯ããŒã¿ãã€ãºãã³ã°æ»æã®äžçš®ã§ãããå°æ°ã®èšç·ŽããŒã¿ãä¿®æ£ããããšã§ã¢ãã«ã«é ããè匱æ§ãæ€ã蟌ã¿ãŸããCLIPãªã©ã®ãã«ãã¢ãŒãã«ã¢ãã«ã¯ãã®åºç¯ãªå¿çšã®ããäž»ãªæ»æå¯Ÿè±¡ã§ãã
æ©æ¢°åŠç¿è§£é€ã¯ç¹å®ã®åŠç¿åäœãéžæçã«é€å»ããããšãç®çãšããæ£ç¢ºãªåŠç¿è§£é€ãšè¿äŒŒçãªåŠç¿è§£é€ã®2ã€ã®ã«ããŽãªã«åãããŸããæ¢åã®ææ³ã¯ããã¯ãã¢é€å»ã¿ã¹ã¯ã§ã®å¹æã¯éå®çã§ãã
ã¿ã¹ã¯ç®è¡ã¯åŠç¿ã¿ã¹ã¯ãéã¿ç©ºéã®ãã¯ãã«ãšããŠãšã³ã³ãŒãããç·åœ¢æäœãéããŠã¿ã¹ã¯ã®è¿œå ãé€å»ãçµã¿åãããå®çŸã§ããŸããéã¿åé¢ç¹æ§ã¯ãããã®æäœã®æå¹æ§ã®çè«çåºç€ã§ãã
çè«çæ€èšŒ : ããã¯ãã¢åäœãšéåžžã®ã¿ã¹ã¯ãéã¿ç©ºéã§åé¢ãããŠããããšã確èªããŸããææ³ã®æå¹æ§ : TBARã¯è€æ°ã®æ»æãšèšå®ã§åªããããã©ãŒãã³ã¹ã瀺ããŸãå®çšçäŸ¡å€ : ããã¯ãã¢é²åŸ¡ã®ããŒã¿ãšèšç®èŠä»¶ã倧å¹
ã«åæžããŸã仮説ãžã®äŸå : ææ³ã¯éã¿åé¢ä»®èª¬ã«åºã¥ããŠããããã¹ãŠã®ã¢ãã«ã¢ãŒããã¯ãã£ã«é©çšã§ããªãå¯èœæ§ããããŸãæ»æã¿ã€ã : äž»ã«æšæºçãªæ»æã§æ€èšŒãããŠãããããè€éãªæ»æã«å¯Ÿããå
ç¢æ§ã«ã¯ãããªãç ç©¶ãå¿
èŠã§ãDECREEäŸå : æ»ææªç¥ã·ããªãªã¯DECREEã®æ€åºèœåã«äŸåããäžéšã®æ»æ(BadCLIPãªã©)ã§ã®å¹æã¯éå®çã§ãä»ã®ã¢ãã«ã¢ãŒããã¯ãã£ãšäºåèšç·Žãã©ãã€ã ãžã®æ¡åŒµ ããè€éãªé©å¿æ»æã«å¯Ÿããé²åŸ¡ã®ç ç©¶ ä»ã®ã»ãã¥ãªãã£ã¿ã¹ã¯ã«ãããéã¿åé¢ã®å¿çšã®æ¢çŽ¢ çè«ç驿° : éã¿åé¢çè«ãããã¯ãã¢é²åŸ¡ã«åããŠäœç³»çã«é©çšããæ°ããçè«çèŠç¹ãæäŸããŸããææ³ã®ç°¡æœæ§ : TBARææ³ã¯åçŽã§å¹æçã§ãããå®è£
ãšå±éã容æã§ãå
æ¬çãªå®éš : è€æ°ã®æ»æã¿ã€ããããŒã¿ã»ãããã¢ãã«ã¢ãŒããã¯ãã£ãã«ããŒããå®éšèšèšã¯ååã§ãå®çšçäŸ¡å€ : ããŒã¿èŠä»¶ã倧å¹
ã«åæžããå®éã®å±éã§éèŠãªäŸ¡å€ããããŸãçè«çå¶é : éã¿åé¢ä»®èª¬ã®æ®éæ§ã«ã¯ãããªãçè«åæãå¿
èŠã§ãæ»æãžã®é©å¿æ§ : ãã®é²åŸ¡æ¹æ³ã«å¯Ÿããé©å¿æ»æãååã«æ€èšããŠããŸããèšç®åæ : 詳现ãªèšç®è€é床åæãšæ¯èŒãäžè¶³ããŠããŸãåŠè¡çäŸ¡å€ : ããã¯ãã¢é²åŸ¡ç ç©¶ã«æ°ããèŠç¹ãæäŸããéã¿ç©ºéã«åºã¥ãããå€ãã®é²åŸ¡æ¹æ³ãåºæ¿ããå¯èœæ§ããããŸãå®çšçäŸ¡å€ : å€§èŠæš¡ã¢ãã«ã®å±éã«ãããŠéèŠãªå¿çšèŠéãããããŸãåçŸæ§ : 詳现ãªå®éšèšå®ãšå®è£
ã®è©³çްãæäŸããåçŸã容æã«ããŸãå€§èŠæš¡ã¢ãã«å±é : ç¹ã«åèšç·Žãã§ããªã倧ååºç€ã¢ãã«ã«é©ããŠããŸããªãœãŒã¹å¶éç°å¢ : ããŒã¿ãšèšç®ãªãœãŒã¹ãéå®ãããã·ããªãªãã«ãã¿ã¹ã¯ã¢ãã« : ãã«ãã¿ã¹ã¯æ§èœãä¿æããå¿
èŠãããã¢ããªã±ãŒã·ã§ã³ã·ããªãªè«æã¯æ¬åéã®éèŠãªç ç©¶ãåŒçšããŠããã以äžãå«ã¿ãŸã:
Ilharco et al. (2022): ã¿ã¹ã¯ç®è¡ã®éæçç ç©¶ Ortiz-Jimenez et al. (2024): éã¿åé¢ã®çè«çåºç€ Bansal et al. (2023): CLIPåŸéé²åŸ¡ã®ãã³ãããŒã¯ææ³ Carlini & Terzis (2021): CLIPåŸéæ»æã®å€å
žçç ç©¶