HoneypotNet: Backdoor Attacks Against Model Extraction
Wang, Gu, Teng et al.
Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.
academic
HoneypotNet: Backdoor Attacks Against Model Extraction
模型提取攻击是一种推理时攻击,通过对黑盒受害模型发起一定数量的查询,利用模型的预测结果训练替代模型,从而近似受害模型的功能和性能。这类攻击对生产模型和MLaaS平台构成严重安全威胁,可能给模型所有者造成重大经济损失。本文提出了一种新的防御范式"以攻为守"(attack as defense),通过修改模型输出使其具有毒性,使任何试图使用这些输出训练替代模型的恶意用户都会受到毒害。为此,作者提出了HoneypotNet这一轻量级后门攻击方法,该方法用蜜罐层替换受害模型的分类层,并通过双层优化与影子模型(模拟模型提取过程)微调蜜罐层,在保持原始性能的同时使输出具有毒性。