HoneypotNet: Backdoor Attacks Against Model Extraction
Wang, Gu, Teng et al.
Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.
academic
HoneypotNet: Backdoor Attacks Against Model Extraction
Model extraction attacks represent a class of inference-time attacks that approximate the functionality and performance of a victim model by training a surrogate model using prediction results obtained through queries to the black-box victim model. Such attacks pose serious security threats to production models and MLaaS platforms, potentially causing substantial economic losses to model owners. This paper proposes a novel defensive paradigm termed "attack as defense," which modifies model outputs to introduce toxicity, thereby poisoning any malicious users attempting to train surrogate models using these outputs. To this end, the authors propose HoneypotNet, a lightweight backdoor attack method that replaces the classification layer of the victim model with a honeypot layer. Through bilevel optimization and fine-tuning with shadow models (simulating the model extraction process), the honeypot layer maintains original performance while introducing toxic outputs.
Model extraction attacks have become one of the primary threats facing Machine Learning as a Service (MLaaS) platforms. Attackers query black-box models via APIs and utilize returned predictions to train functionally similar surrogate models, thereby stealing the model's intellectual property.
Existing defense methods fall into two categories:
Passive Defense: Detects malicious queries or performs post-hoc verification using watermarks, but relies on prior knowledge with limited effectiveness
Active Defense: Prevents extraction through output perturbation or increased query costs, but incurs significant computational overhead and may be circumvented by advanced attacks
Traditional defense methods suffer from an arms race problem. This paper proposes a novel "attack as defense" paradigm that proactively attacks surrogate models to compromise their functionality, creating a strong deterrent against attackers.
Novel Defense Paradigm: First to propose the "attack as defense" defense paradigm, proactively conducting backdoor attacks against surrogate models
HoneypotNet Method: Designs a lightweight honeypot layer replacing the original classification layer, generating toxic probability vectors through bilevel optimization
Trigger-free Backdoor: Innovatively employs Universal Adversarial Perturbations (UAP) as backdoor triggers, eliminating the need for explicit trigger injection into images
Dual Functionality: Injected backdoors enable both ownership verification and surrogate model functionality destruction, creating a strong deterrent effect
Experimental Validation: Validates method effectiveness on four benchmark datasets with attack success rates ranging from 56.99% to 92.35%
Using Cognitive Distillation (CD) detection methods, results show that L1 norm distributions of clean and backdoored samples are highly similar, indicating excellent stealthiness of UAP triggers.
Testing against Reconstructive Neuron Pruning (RNP) defense demonstrates that ASR remains at high levels even after pruning, showing backdoor robustness.
The paper cites important works in machine learning security, model extraction attacks and defenses, and backdoor attacks, providing a solid theoretical foundation for the research.
Overall Assessment: The HoneypotNet method proposed in this paper holds significant innovative value in the field of model extraction defense. The "attack as defense" approach opens new research directions for this field. The technical implementation is ingenious, experimental evaluation is comprehensive, and it possesses high academic and practical value. While there is room for improvement in theoretical analysis and certain technical details, overall this is a high-quality research work.