2025-11-21T16:31:15.957266

HoneypotNet: Backdoor Attacks Against Model Extraction

Wang, Gu, Teng et al.

Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.

academic

HoneypotNet: Backdoor Attacks Against Model Extraction

Basic Information

Paper ID: 2501.01090
Title: HoneypotNet: Backdoor Attacks Against Model Extraction
Authors: Yixu Wang, Tianle Gu, Yan Teng, Yingchun Wang, Xingjun Ma
Classification: cs.CR (Cryptography and Security), cs.CV (Computer Vision)
Submission Date/Venue: Submitted to arXiv on January 2, 2025
Paper Link: https://arxiv.org/abs/2501.01090

Abstract

Model extraction attacks represent a class of inference-time attacks that approximate the functionality and performance of a victim model by training a surrogate model using prediction results obtained through queries to the black-box victim model. Such attacks pose serious security threats to production models and MLaaS platforms, potentially causing substantial economic losses to model owners. This paper proposes a novel defensive paradigm termed "attack as defense," which modifies model outputs to introduce toxicity, thereby poisoning any malicious users attempting to train surrogate models using these outputs. To this end, the authors propose HoneypotNet, a lightweight backdoor attack method that replaces the classification layer of the victim model with a honeypot layer. Through bilevel optimization and fine-tuning with shadow models (simulating the model extraction process), the honeypot layer maintains original performance while introducing toxic outputs.

Research Background and Motivation

Problem Definition

Model extraction attacks have become one of the primary threats facing Machine Learning as a Service (MLaaS) platforms. Attackers query black-box models via APIs and utilize returned predictions to train functionally similar surrogate models, thereby stealing the model's intellectual property.

Problem Significance

Economic Loss: Model extraction attacks can result in substantial economic losses for model owners
Intellectual Property Protection: Deep learning models incur high training costs and require effective protection
Security Threats: Attackers can leverage extracted models to conduct further adversarial attacks

Limitations of Existing Methods

Existing defense methods fall into two categories:

Passive Defense: Detects malicious queries or performs post-hoc verification using watermarks, but relies on prior knowledge with limited effectiveness
Active Defense: Prevents extraction through output perturbation or increased query costs, but incurs significant computational overhead and may be circumvented by advanced attacks

Research Motivation

Traditional defense methods suffer from an arms race problem. This paper proposes a novel "attack as defense" paradigm that proactively attacks surrogate models to compromise their functionality, creating a strong deterrent against attackers.

Core Contributions

Novel Defense Paradigm: First to propose the "attack as defense" defense paradigm, proactively conducting backdoor attacks against surrogate models
HoneypotNet Method: Designs a lightweight honeypot layer replacing the original classification layer, generating toxic probability vectors through bilevel optimization
Trigger-free Backdoor: Innovatively employs Universal Adversarial Perturbations (UAP) as backdoor triggers, eliminating the need for explicit trigger injection into images
Dual Functionality: Injected backdoors enable both ownership verification and surrogate model functionality destruction, creating a strong deterrent effect
Experimental Validation: Validates method effectiveness on four benchmark datasets with attack success rates ranging from 56.99% to 92.35%

Methodology Details

Task Definition

Given a victim model F, the objective is to design a honeypot layer H such that:

Original performance is maintained on normal inputs
When attackers train surrogate model F̂ using H's outputs, F̂ becomes backdoored
The backdoor can be used for ownership verification and counter-attacks

Model Architecture

Honeypot Layer Design

The honeypot layer H is defined as a fully connected layer:

H(x) = W · F_feat(x) + b

where F_feat(x) is the feature output of the victim model, and W and b are learnable parameters.

Bilevel Optimization Framework

The core optimization objective is:

argmin_θH E_x∈Ds[L(H(x),F(x)) + L(H(x+δ),y_target)]

Subject to constraints:

argmin_θFs E_x∈Ds[L(Fs(x),H(x))]
argmin_δ E_x∈Dv[L(Fs(x+δ),y_target)]

Three-Step Iterative Process

Extraction Simulation: Uses shadow model Fs to simulate the attacker's model extraction process
Trigger Generation: Generates UAP triggers δ through gradient sign updates
Fine-tuning: Updates honeypot layer parameters to inject backdoors while maintaining normal functionality

Technical Innovations

Universal Adversarial Perturbations as Triggers

Leverages inherent adversarial vulnerability of deep learning models
UAP serves as a non-toxic trigger without explicit injection
Achieves backdoor transfer through shared adversarial vulnerability

Momentum-based Trigger Update

δi = α·δi-1 - (1-α)·ε·sign(E_x∈Dv[g(δi-1)])
g(δ) = ∇δL(Fs(M⊙x + (1-M)⊙δ), y_target)

Mask Constraints

Uses predefined masks M to restrict trigger locations, enhancing stealthiness.

Experimental Setup

Datasets

Victim Model Datasets: CIFAR10, CIFAR100, Caltech256, CUBS200
Attack Dataset: ImageNet (1.2 million images)
Shadow Dataset: CC3M (5,000 randomly selected images)
Validation Dataset: Small-scale task-specific datasets

Evaluation Metrics

Clean Test Accuracy (Acc_c): Surrogate model accuracy on clean test samples
Verification Test Accuracy (Acc_v): Surrogate model accuracy in predicting target labels on triggered samples
Attack Success Rate (ASR): Success rate of the defender's counter-attack

Baseline Methods

Extraction Attacks: KnockoffNets, ActiveThief (Entropy & k-Center), SPSG, BlackBox Dissector
Baseline Defenses: No defense, DVBW (dataset ownership verification method)

Implementation Details

BLO Iterations: 30 iterations, each containing 3 steps with 5 epochs each
Shadow Model: ResNet18 (lightweight)
Trigger Size: 6×6 for CIFAR datasets, 28×28 for others
Optimizer: SGD with momentum 0.9, learning rate 0.1 (shadow model)/0.02 (honeypot layer)

Experimental Results

Main Results

Under a 30k query budget, HoneypotNet achieves significant results across all datasets and attack methods:

Attack Method	CIFAR10 ASR	CIFAR100 ASR	CUBS200 ASR	Caltech256 ASR
KnockoffNets	59.35%	85.71%	78.31%	79.13%
ActiveThief (Entropy)	56.99%	74.35%	83.22%	77.43%
ActiveThief (k-Center)	67.49%	74.63%	80.27%	80.80%
SPSG	66.12%	77.11%	83.51%	77.88%
BlackBox Dissector	78.59%	80.05%	92.35%	78.98%

Key Findings

High Success Rate: ASR exceeds 56% across all test scenarios
Performance Preservation: Acc_c remains comparable to undefended cases, avoiding attacker suspicion
Strong Verification Capability: Acc_v significantly outperforms baseline methods, effectively supporting ownership verification
Hard Label Robustness: Maintains high effectiveness even under BlackBox Dissector's hard-label attacks

Ablation Studies

Trigger Size Impact

Experiments varying trigger size from 1×1 to 15×15 demonstrate:
Larger triggers yield higher ASR
Need to balance trigger size with stealthiness

Different Surrogate Model Architectures

Architecture	CIFAR10 ASR	CIFAR100 ASR	CUBS200 ASR	Caltech256 ASR
ResNet34	59.35%	85.71%	78.31%	79.13%
VGG16	97.16%	87.10%	89.82%	62.17%
DenseNet121	51.68%	53.72%	65.46%	58.00%

Defense Robustness Analysis

Backdoor Detection Evasion

Using Cognitive Distillation (CD) detection methods, results show that L1 norm distributions of clean and backdoored samples are highly similar, indicating excellent stealthiness of UAP triggers.

Neural Pruning Robustness

Testing against Reconstructive Neuron Pruning (RNP) defense demonstrates that ASR remains at high levels even after pruning, showing backdoor robustness.

Model Extraction Attacks

Data Synthesis Methods: Use GANs or diffusion models to generate synthetic training data
Data Selection Methods: Select information-rich samples from pre-existing data pools, such as KnockoffNets and ActiveThief

Model Extraction Defense

Extraction Detection: Monitor user query behavior to detect malicious users
Proof of Work: Increase query costs
Model Watermarking: Embed verifiable features
Prediction Perturbation: Add perturbations to model predictions

Backdoor Attacks

Dirty Image Attacks: Inject triggered samples into training data
Clean Image Attacks: Inject backdoors directly without modifying images

Conclusions and Discussion

Main Conclusions

Paradigm Effectiveness: The "attack as defense" paradigm provides new insights for model extraction defense
Technical Feasibility: HoneypotNet successfully implements lightweight backdoor injection
Practical Value: The method performs excellently across various attack scenarios with practical application potential

Limitations

Computational Overhead: Although relatively lightweight, still requires bilevel optimization
Trigger Visibility: Larger triggers may be discovered
Architecture Dependency: Effectiveness varies across different surrogate model architectures
Defense Adversarialism: May face challenges from more advanced defense methods

Future Directions

Integrated Shadow Models: Use multiple shadow models to enhance robustness
Adaptive Triggers: Design more stealthy trigger generation methods
Extended Applications: Extend methods to other model types and tasks
Theoretical Analysis: Provide deeper theoretical guarantees

In-Depth Evaluation

Strengths

Strong Innovation: First to propose the "attack as defense" defense paradigm with novel thinking
Advanced Techniques: Cleverly combines UAP and backdoor attacks, solving the technical challenge of trigger-free injection
Comprehensive Experiments: Thorough evaluation across multiple datasets and attack methods
High Practical Value: Lightweight method suitable for deployment in real systems
Dual Functionality: Simultaneously achieves ownership verification and functionality destruction with strong deterrent effect

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical guarantees for convergence and security
Defense Limitations: Robustness against certain advanced attack methods requires further verification
Ethical Considerations: Proactive attacks on surrogate models may raise ethical and legal concerns
Limited Scope: Primarily targets image classification tasks; applicability to other tasks remains unknown

Impact

Academic Contribution: Provides new research directions for model security defense
Practical Value: Offers practical defense tools for MLaaS platforms
Reproducibility: Detailed implementation details facilitate reproduction
Inspirational Value: May inspire more "attack as defense" type defense methods

Applicable Scenarios

MLaaS Platforms: Model protection for cloud machine learning services
Commercial Models: Intellectual property protection for high-value deep learning models
API Services: Online inference services requiring model theft prevention
Edge Deployment: Lightweight defense for resource-constrained environments

References

The paper cites important works in machine learning security, model extraction attacks and defenses, and backdoor attacks, providing a solid theoretical foundation for the research.

Overall Assessment: The HoneypotNet method proposed in this paper holds significant innovative value in the field of model extraction defense. The "attack as defense" approach opens new research directions for this field. The technical implementation is ingenious, experimental evaluation is comprehensive, and it possesses high academic and practical value. While there is room for improvement in theoretical analysis and certain technical details, overall this is a high-quality research work.