2025-11-21T16:31:15.957266

HoneypotNet: Backdoor Attacks Against Model Extraction

Wang, Gu, Teng et al.
Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.
academic

HoneypotNet: Backdoor Attacks Against Model Extraction

Basic Information

  • Paper ID: 2501.01090
  • Title: HoneypotNet: Backdoor Attacks Against Model Extraction
  • Authors: Yixu Wang, Tianle Gu, Yan Teng, Yingchun Wang, Xingjun Ma
  • Classification: cs.CR (Cryptography and Security), cs.CV (Computer Vision)
  • Submission Date/Venue: Submitted to arXiv on January 2, 2025
  • Paper Link: https://arxiv.org/abs/2501.01090

Abstract

Model extraction attacks represent a class of inference-time attacks that approximate the functionality and performance of a victim model by training a surrogate model using prediction results obtained through queries to the black-box victim model. Such attacks pose serious security threats to production models and MLaaS platforms, potentially causing substantial economic losses to model owners. This paper proposes a novel defensive paradigm termed "attack as defense," which modifies model outputs to introduce toxicity, thereby poisoning any malicious users attempting to train surrogate models using these outputs. To this end, the authors propose HoneypotNet, a lightweight backdoor attack method that replaces the classification layer of the victim model with a honeypot layer. Through bilevel optimization and fine-tuning with shadow models (simulating the model extraction process), the honeypot layer maintains original performance while introducing toxic outputs.

Research Background and Motivation

Problem Definition

Model extraction attacks have become one of the primary threats facing Machine Learning as a Service (MLaaS) platforms. Attackers query black-box models via APIs and utilize returned predictions to train functionally similar surrogate models, thereby stealing the model's intellectual property.

Problem Significance

  1. Economic Loss: Model extraction attacks can result in substantial economic losses for model owners
  2. Intellectual Property Protection: Deep learning models incur high training costs and require effective protection
  3. Security Threats: Attackers can leverage extracted models to conduct further adversarial attacks

Limitations of Existing Methods

Existing defense methods fall into two categories:

  1. Passive Defense: Detects malicious queries or performs post-hoc verification using watermarks, but relies on prior knowledge with limited effectiveness
  2. Active Defense: Prevents extraction through output perturbation or increased query costs, but incurs significant computational overhead and may be circumvented by advanced attacks

Research Motivation

Traditional defense methods suffer from an arms race problem. This paper proposes a novel "attack as defense" paradigm that proactively attacks surrogate models to compromise their functionality, creating a strong deterrent against attackers.

Core Contributions

  1. Novel Defense Paradigm: First to propose the "attack as defense" defense paradigm, proactively conducting backdoor attacks against surrogate models
  2. HoneypotNet Method: Designs a lightweight honeypot layer replacing the original classification layer, generating toxic probability vectors through bilevel optimization
  3. Trigger-free Backdoor: Innovatively employs Universal Adversarial Perturbations (UAP) as backdoor triggers, eliminating the need for explicit trigger injection into images
  4. Dual Functionality: Injected backdoors enable both ownership verification and surrogate model functionality destruction, creating a strong deterrent effect
  5. Experimental Validation: Validates method effectiveness on four benchmark datasets with attack success rates ranging from 56.99% to 92.35%

Methodology Details

Task Definition

Given a victim model F, the objective is to design a honeypot layer H such that:

  • Original performance is maintained on normal inputs
  • When attackers train surrogate model F̂ using H's outputs, F̂ becomes backdoored
  • The backdoor can be used for ownership verification and counter-attacks

Model Architecture

Honeypot Layer Design

The honeypot layer H is defined as a fully connected layer:

H(x) = W · F_feat(x) + b

where F_feat(x) is the feature output of the victim model, and W and b are learnable parameters.

Bilevel Optimization Framework

The core optimization objective is:

argmin_θH E_x∈Ds[L(H(x),F(x)) + L(H(x+δ),y_target)]

Subject to constraints:

argmin_θFs E_x∈Ds[L(Fs(x),H(x))]
argmin_δ E_x∈Dv[L(Fs(x+δ),y_target)]

Three-Step Iterative Process

  1. Extraction Simulation: Uses shadow model Fs to simulate the attacker's model extraction process
  2. Trigger Generation: Generates UAP triggers δ through gradient sign updates
  3. Fine-tuning: Updates honeypot layer parameters to inject backdoors while maintaining normal functionality

Technical Innovations

Universal Adversarial Perturbations as Triggers

  • Leverages inherent adversarial vulnerability of deep learning models
  • UAP serves as a non-toxic trigger without explicit injection
  • Achieves backdoor transfer through shared adversarial vulnerability

Momentum-based Trigger Update

δi = α·δi-1 - (1-α)·ε·sign(E_x∈Dv[g(δi-1)])
g(δ) = ∇δL(Fs(M⊙x + (1-M)⊙δ), y_target)

Mask Constraints

Uses predefined masks M to restrict trigger locations, enhancing stealthiness.

Experimental Setup

Datasets

  • Victim Model Datasets: CIFAR10, CIFAR100, Caltech256, CUBS200
  • Attack Dataset: ImageNet (1.2 million images)
  • Shadow Dataset: CC3M (5,000 randomly selected images)
  • Validation Dataset: Small-scale task-specific datasets

Evaluation Metrics

  1. Clean Test Accuracy (Acc_c): Surrogate model accuracy on clean test samples
  2. Verification Test Accuracy (Acc_v): Surrogate model accuracy in predicting target labels on triggered samples
  3. Attack Success Rate (ASR): Success rate of the defender's counter-attack

Baseline Methods

  • Extraction Attacks: KnockoffNets, ActiveThief (Entropy & k-Center), SPSG, BlackBox Dissector
  • Baseline Defenses: No defense, DVBW (dataset ownership verification method)

Implementation Details

  • BLO Iterations: 30 iterations, each containing 3 steps with 5 epochs each
  • Shadow Model: ResNet18 (lightweight)
  • Trigger Size: 6×6 for CIFAR datasets, 28×28 for others
  • Optimizer: SGD with momentum 0.9, learning rate 0.1 (shadow model)/0.02 (honeypot layer)

Experimental Results

Main Results

Under a 30k query budget, HoneypotNet achieves significant results across all datasets and attack methods:

Attack MethodCIFAR10 ASRCIFAR100 ASRCUBS200 ASRCaltech256 ASR
KnockoffNets59.35%85.71%78.31%79.13%
ActiveThief (Entropy)56.99%74.35%83.22%77.43%
ActiveThief (k-Center)67.49%74.63%80.27%80.80%
SPSG66.12%77.11%83.51%77.88%
BlackBox Dissector78.59%80.05%92.35%78.98%

Key Findings

  1. High Success Rate: ASR exceeds 56% across all test scenarios
  2. Performance Preservation: Acc_c remains comparable to undefended cases, avoiding attacker suspicion
  3. Strong Verification Capability: Acc_v significantly outperforms baseline methods, effectively supporting ownership verification
  4. Hard Label Robustness: Maintains high effectiveness even under BlackBox Dissector's hard-label attacks

Ablation Studies

Trigger Size Impact

  • Experiments varying trigger size from 1×1 to 15×15 demonstrate:
  • Larger triggers yield higher ASR
  • Need to balance trigger size with stealthiness

Different Surrogate Model Architectures

ArchitectureCIFAR10 ASRCIFAR100 ASRCUBS200 ASRCaltech256 ASR
ResNet3459.35%85.71%78.31%79.13%
VGG1697.16%87.10%89.82%62.17%
DenseNet12151.68%53.72%65.46%58.00%

Defense Robustness Analysis

Backdoor Detection Evasion

Using Cognitive Distillation (CD) detection methods, results show that L1 norm distributions of clean and backdoored samples are highly similar, indicating excellent stealthiness of UAP triggers.

Neural Pruning Robustness

Testing against Reconstructive Neuron Pruning (RNP) defense demonstrates that ASR remains at high levels even after pruning, showing backdoor robustness.

Model Extraction Attacks

  1. Data Synthesis Methods: Use GANs or diffusion models to generate synthetic training data
  2. Data Selection Methods: Select information-rich samples from pre-existing data pools, such as KnockoffNets and ActiveThief

Model Extraction Defense

  1. Extraction Detection: Monitor user query behavior to detect malicious users
  2. Proof of Work: Increase query costs
  3. Model Watermarking: Embed verifiable features
  4. Prediction Perturbation: Add perturbations to model predictions

Backdoor Attacks

  1. Dirty Image Attacks: Inject triggered samples into training data
  2. Clean Image Attacks: Inject backdoors directly without modifying images

Conclusions and Discussion

Main Conclusions

  1. Paradigm Effectiveness: The "attack as defense" paradigm provides new insights for model extraction defense
  2. Technical Feasibility: HoneypotNet successfully implements lightweight backdoor injection
  3. Practical Value: The method performs excellently across various attack scenarios with practical application potential

Limitations

  1. Computational Overhead: Although relatively lightweight, still requires bilevel optimization
  2. Trigger Visibility: Larger triggers may be discovered
  3. Architecture Dependency: Effectiveness varies across different surrogate model architectures
  4. Defense Adversarialism: May face challenges from more advanced defense methods

Future Directions

  1. Integrated Shadow Models: Use multiple shadow models to enhance robustness
  2. Adaptive Triggers: Design more stealthy trigger generation methods
  3. Extended Applications: Extend methods to other model types and tasks
  4. Theoretical Analysis: Provide deeper theoretical guarantees

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose the "attack as defense" defense paradigm with novel thinking
  2. Advanced Techniques: Cleverly combines UAP and backdoor attacks, solving the technical challenge of trigger-free injection
  3. Comprehensive Experiments: Thorough evaluation across multiple datasets and attack methods
  4. High Practical Value: Lightweight method suitable for deployment in real systems
  5. Dual Functionality: Simultaneously achieves ownership verification and functionality destruction with strong deterrent effect

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical guarantees for convergence and security
  2. Defense Limitations: Robustness against certain advanced attack methods requires further verification
  3. Ethical Considerations: Proactive attacks on surrogate models may raise ethical and legal concerns
  4. Limited Scope: Primarily targets image classification tasks; applicability to other tasks remains unknown

Impact

  1. Academic Contribution: Provides new research directions for model security defense
  2. Practical Value: Offers practical defense tools for MLaaS platforms
  3. Reproducibility: Detailed implementation details facilitate reproduction
  4. Inspirational Value: May inspire more "attack as defense" type defense methods

Applicable Scenarios

  1. MLaaS Platforms: Model protection for cloud machine learning services
  2. Commercial Models: Intellectual property protection for high-value deep learning models
  3. API Services: Online inference services requiring model theft prevention
  4. Edge Deployment: Lightweight defense for resource-constrained environments

References

The paper cites important works in machine learning security, model extraction attacks and defenses, and backdoor attacks, providing a solid theoretical foundation for the research.


Overall Assessment: The HoneypotNet method proposed in this paper holds significant innovative value in the field of model extraction defense. The "attack as defense" approach opens new research directions for this field. The technical implementation is ingenious, experimental evaluation is comprehensive, and it possesses high academic and practical value. While there is room for improvement in theoretical analysis and certain technical details, overall this is a high-quality research work.