2025-11-21T16:31:15.957266

HoneypotNet: Backdoor Attacks Against Model Extraction

Wang, Gu, Teng et al.

Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.

academic

HoneypotNet: Backdoor Attacks Against Model Extraction

基本信息

论文ID: 2501.01090
标题: HoneypotNet: Backdoor Attacks Against Model Extraction
作者: Yixu Wang, Tianle Gu, Yan Teng, Yingchun Wang, Xingjun Ma
分类: cs.CR (Cryptography and Security), cs.CV (Computer Vision)
发表时间/会议: 2025年1月2日提交至arXiv
论文链接: https://arxiv.org/abs/2501.01090

摘要

模型提取攻击是一种推理时攻击，通过对黑盒受害模型发起一定数量的查询，利用模型的预测结果训练替代模型，从而近似受害模型的功能和性能。这类攻击对生产模型和MLaaS平台构成严重安全威胁，可能给模型所有者造成重大经济损失。本文提出了一种新的防御范式"以攻为守"(attack as defense)，通过修改模型输出使其具有毒性，使任何试图使用这些输出训练替代模型的恶意用户都会受到毒害。为此，作者提出了HoneypotNet这一轻量级后门攻击方法，该方法用蜜罐层替换受害模型的分类层，并通过双层优化与影子模型（模拟模型提取过程）微调蜜罐层，在保持原始性能的同时使输出具有毒性。

研究背景与动机

问题定义

模型提取攻击已成为机器学习即服务(MLaaS)平台面临的主要威胁之一。攻击者通过API查询黑盒模型，利用返回的预测结果训练功能相似的替代模型，从而窃取模型的知识产权。

问题重要性

经济损失：模型提取攻击可能导致模型所有者遭受重大经济损失
知识产权保护：深度学习模型的训练成本高昂，需要有效保护
安全威胁：攻击者可利用提取的模型进行进一步的对抗攻击

现有方法局限性

现有防御方法主要分为两类：

被动防御：通过检测恶意查询或使用水印进行事后验证，但依赖先验知识，效果有限
主动防御：通过扰动模型输出或增加查询开销来阻止提取，但计算开销大且可能被高级攻击绕过

研究动机

传统防御方法存在军备竞赛问题，本文提出"以攻为守"的新范式，主动攻击替代模型以破坏其功能，对攻击者形成强有力的威慑。

核心贡献

新防御范式：首次提出"以攻为守"(attack as defense)的防御范式，主动对替代模型进行后门攻击
HoneypotNet方法：设计了轻量级的蜜罐层替换原始分类层，通过双层优化生成有毒的概率向量
无触发器后门：创新性地使用通用对抗扰动(UAP)作为后门触发器，无需在图像中显式注入触发器
双重功能：注入的后门既能进行所有权验证，又能破坏替代模型功能，形成强威慑效果
实验验证：在四个基准数据集上验证了方法的有效性，攻击成功率达到56.99%-92.35%

方法详解

任务定义

给定一个受害模型F，目标是设计一个蜜罐层H，使得：

在正常输入上保持原始性能
当攻击者使用H的输出训练替代模型F̂时，F̂会被注入后门
后门可用于所有权验证和反向攻击

模型架构

蜜罐层设计

蜜罐层H定义为全连接层：

H(x) = W · F_feat(x) + b

其中F_feat(x)是受害模型的特征输出，W和b是可学习参数。

双层优化框架

核心优化目标为：

argmin_θH E_x∈Ds[L(H(x),F(x)) + L(H(x+δ),y_target)]

约束条件：

argmin_θFs E_x∈Ds[L(Fs(x),H(x))]
argmin_δ E_x∈Dv[L(Fs(x+δ),y_target)]

三步迭代过程

提取模拟：使用影子模型Fs模拟攻击者的模型提取过程
触发器生成：通过梯度符号更新生成UAP触发器δ
微调：更新蜜罐层参数以注入后门同时保持正常功能

技术创新点

通用对抗扰动作为触发器

利用深度学习模型固有的对抗脆弱性
UAP可作为无毒化触发器，无需显式注入
通过共享对抗脆弱性实现后门传递

动量优化的触发器更新

δi = α·δi-1 - (1-α)·ε·sign(E_x∈Dv[g(δi-1)])
g(δ) = ∇δL(Fs(M⊙x + (1-M)⊙δ), y_target)

掩码约束

使用预定义掩码M限制触发器位置，增强隐蔽性。

实验设置

数据集

受害模型数据集：CIFAR10、CIFAR100、Caltech256、CUBS200
攻击数据集：ImageNet（120万图像）
影子数据集：CC3M（随机选择5000张图像）
验证数据集：小规模任务相关数据集

评价指标

Clean Test Accuracy (Acc_c)：替代模型在干净测试样本上的准确率
Verification Test Accuracy (Acc_v)：替代模型在触发样本上预测目标标签的准确率
Attack Success Rate (ASR)：防御者成功进行反向攻击的成功率

对比方法

提取攻击：KnockoffNets、ActiveThief (Entropy & k-Center)、SPSG、BlackBox Dissector
基线防御：无防御、DVBW（数据集所有权验证方法）

实现细节

BLO迭代：30次迭代，每次包含3个步骤各5个epoch
影子模型：ResNet18（轻量级）
触发器大小：CIFAR数据集6×6，其他数据集28×28
优化器：SGD，动量0.9，学习率0.1（影子模型）/0.02（蜜罐层）

实验结果

主要结果

在30k查询预算下，HoneypotNet在所有数据集和攻击方法上都取得了显著效果：

攻击方法	CIFAR10 ASR	CIFAR100 ASR	CUBS200 ASR	Caltech256 ASR
KnockoffNets	59.35%	85.71%	78.31%	79.13%
ActiveThief (Entropy)	56.99%	74.35%	83.22%	77.43%
ActiveThief (k-Center)	67.49%	74.63%	80.27%	80.80%
SPSG	66.12%	77.11%	83.51%	77.88%
BlackBox Dissector	78.59%	80.05%	92.35%	78.98%