2025-11-20T03:28:13.599904

Provable Watermarking for Data Poisoning Attacks

Zhu, Yu, Gao

In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: {\em post-poisoning watermarking} and {\em poisoning-concurrent watermarking}. Our analyses demonstrate that when the watermarking length is $Î(\sqrt{d}/Îµ_w)$ for post-poisoning watermarking, and falls within the range of $Î(1/Îµ_w^2)$ to $O(\sqrt{d}/Îµ_p)$ for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.

academic

Provable Watermarking for Data Poisoning Attacks

Basic Information

Paper ID: 2510.09210
Title: Provable Watermarking for Data Poisoning Attacks
Authors: Yifan Zhu, Lijia Yu, Xiao-Shan Gao
Categories: cs.CR (Cryptography and Security), cs.LG (Machine Learning)
Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2510.09210

Abstract

In recent years, data poisoning attacks have increasingly been designed in seemingly benign or even beneficial forms, commonly used for dataset ownership verification or protecting private data from unauthorized use. However, these developments may lead to misunderstandings and conflicts, as data poisoning has traditionally been viewed as a security threat to machine learning systems. To address this issue, benign poisoning generators must declare ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. This paper proposes deploying watermarking schemes as a solution to this challenge, introducing two provably secure and practical data poisoning watermarking methods: post-poisoning watermarking and poisoning-concurrent watermarking. Analysis demonstrates that when watermark length is Θ(√d/ε_w) (post-poisoning watermarking) and ranges from Θ(1/ε_w²) to O(√d/ε_p) (poisoning-concurrent watermarking), watermarked poisoned datasets provably ensure watermark detectability and poisoning utility.

Research Background and Motivation

Problem Definition

Paradigm Shift: Data poisoning attacks are transitioning from traditional malicious threats to "benign" applications, such as dataset ownership verification and preventing unauthorized use
Transparency Issues: When poisoning is used for protective purposes, authorized users may inadvertently use poisoned data, leading to misunderstandings and conflicts
Lack of Accountability: Existing detection methods lack a unified framework and provable declaration mechanisms

Significance

As large-scale model training increasingly relies on web-scraped or synthetic data, the impact of data poisoning becomes more pronounced
Artists and data creators need to protect their intellectual property from unauthorized use by generative AI
A balance must be established between data protection and transparency

Limitations of Existing Methods

Detection methods vary by attack type, making unification difficult
Based on heuristic training algorithms, lacking provable mechanisms
Cannot provide clear, verifiable declarations for poisoned datasets

Core Contributions

First Framework for Data Poisoning Watermarking: Applies watermarking techniques to data poisoning scenarios, providing transparency and accountability
Two Watermarking Schemes:
- Post-poisoning watermarking: Third-party entities create watermarks for already-poisoned datasets
- Poisoning-concurrent watermarking: Poisoning generators simultaneously create watermarks and poisoning
Theoretical Guarantees: Provides rigorous theoretical analysis of watermark detectability and poisoning utility
Practical Validation: Verifies theoretical findings across multiple attacks, models, and datasets

Methodology Details

Task Definition

Input: Original dataset D, poisoning budget ε_p, watermarking budget ε_w
Output: Watermarked poisoned dataset, detection key ζ
Constraints: Maintain poisoning utility while ensuring watermark detectability

Model Architecture

1. Post-Poisoning Watermarking

Original data x → Poisoning δ_p → Poisoned data x' → Watermarking δ_w → Final data x' + δ_w

Third-party entities add watermarks to already-poisoned data
Total perturbation budget: ε_p + ε_w
Watermark length requirement: Θ(√d/ε_w)

2. Poisoning-Concurrent Watermarking

Original data x → Simultaneous poisoning and watermarking → Final data x + δ_p + δ_w

Poisoning generators simultaneously control poisoning and watermarking
Dimension separation: Watermark dimensions W, Poisoning dimensions P = d\W
Total perturbation budget: max{ε_p, ε_w}
Watermark length requirement: Θ(1/ε_w²) to O(√d/ε_p)

3. Detection Mechanism

Key: d-dimensional vector ζ
Detection: Compute inner product ζᵀx, compare with threshold
Decision: ζᵀ(poisoned data) > threshold > ζᵀ(normal data)

Technical Innovations

1. Theoretical Framework Innovation

Sample-level Analysis: Independent watermarking and keys for each data point
Universal Version: Single key applicable to all samples
Distribution Generalization: Extension from finite samples to overall distribution

2. Mathematical Guarantees

Using McDiarmid's inequality and VC dimension theory, proving:

Detectability: High-probability distinction between poisoned and normal data
Utility Preservation: Controllable impact of watermarking on poisoning effects
Generalization Performance: Extension of finite sample results to distributions

3. Dimension Separation Strategy

Poisoning-concurrent watermarking avoids interference through dimension separation:

Watermarking uses dimensions W = {d₁, d₂, ..., d_q}
Poisoning uses dimensions P = d\W
Reduces mutual interference and improves performance

Experimental Setup

Datasets

CIFAR-10/CIFAR-100: Classic image classification datasets
Tiny-ImageNet: Small-scale ImageNet
SST-2: Text sentiment analysis dataset

Attack Methods

Backdoor Attacks

Narcissus: Clean-label backdoor attack
AdvSc: Adversarial backdoor attack

Availability Attacks

UE (Unlearnable Examples): Unlearnable samples
AP (Adversarial Poisoning): Adversarial poisoning

Model Architectures

ResNet-18/50, VGG-19, DenseNet121
WRN34-10, MobileNet v2, ViT-B
BERT-base (text tasks)

Evaluation Metrics

Accuracy (Acc): Model performance on test set
Attack Success Rate (ASR): Effectiveness of backdoor attacks
AUROC: Watermark detection performance
Computational Overhead: Time cost analysis

Implementation Details

Watermarking/poisoning budget: 4/255 to 32/255
Watermark length: 100 to 3000
Training: 200 epochs, cosine learning rate scheduling
Optimizer: SGD, momentum 0.9, weight decay 10⁻⁴

Experimental Results

Main Results

1. Watermark Detection Performance

Watermark Length	Narcissus (Post)	Narcissus (Concurrent)	AdvSc (Post)	AdvSc (Concurrent)
500	0.9509	0.9968	0.9218	0.9986
1000	0.9974	0.9992	0.9809	0.9995
2000	1.0000	1.0000	0.9994	1.0000

2. Poisoning Utility Preservation

Post-poisoning watermarking: Maintains good attack performance across all watermark lengths
Poisoning-concurrent watermarking: Significant attack performance degradation with excessive watermark length

3. Theoretical Verification

Experimental results validate theoretical predictions:

Poisoning-concurrent watermarking requires shorter watermark lengths for equivalent detection performance
Post-poisoning watermarking has minimal impact on poisoning utility
Positive correlation between watermark length and detection performance

Ablation Studies

1. Watermarking Budget Impact

With increasing ε_w:

Detection performance (AUROC) improves
Poisoning effectiveness decreases
Validates theoretical trade-off relationships

2. Watermark Position Analysis

Testing different image regions (top-left, bottom-left, top-right, bottom-right):

Minimal position impact on performance
Validates position-independence in theory

3. Model Transferability

Demonstrates good transferability across architectures:

High AUROC scores (>0.95)
Stable cross-architecture detection

Robustness Analysis

1. Data Augmentation Resistance

Testing Random Flip, Cutout, Color Jitter, etc.:

AUROC maintains 1.0000
Demonstrates strong robustness

2. Defense Methods

Differential Privacy: Severe noise causes training failure
Diffusion Purification: Simultaneously corrupts watermarking and poisoning
Adversarial Denoising: Affects poisoning utility

Data Poisoning Research

Backdoor Attacks: BadNets, Narcissus, etc.
Availability Attacks: Unlearnable samples, adversarial poisoning
Defense Methods: Detection algorithms, data purification

Watermarking Techniques

Model Watermarking: Neural network copyright protection
Data Watermarking: Dataset ownership verification
Text Watermarking: Generated content detection in large language models

Technical Distinctions

This paper is the first to systematically apply watermarking techniques to data poisoning scenarios, providing theoretical guarantees and practical solutions.

Conclusions and Discussion

Main Conclusions

Theoretical Contribution: Establishes theoretical framework for data poisoning watermarking
Practical Solutions: Provides two deployable watermarking methods
Performance Verification: Experiments confirm theoretical predictions
Application Value: Provides transparency and accountability for "benign" poisoning

Limitations

Unknown Necessary Conditions: Only provides sufficient conditions; necessary conditions require further research
Defense Vulnerability: Performance degradation against strong defense methods
Computational Overhead: Poisoning-concurrent watermarking requires additional computation time
Limited Scope: Primarily targets imperceptible poisoning attacks

Future Directions

Enhanced Robustness: Design watermarking schemes resistant to defenses
Necessary Conditions: Explore necessary conditions for watermark detectability
Efficiency Optimization: Reduce computational and storage overhead
Application Extension: Extend to more poisoning types and domains

In-Depth Evaluation

Strengths

Problem Importance: Addresses practical transparency needs in data poisoning
Theoretical Rigor: Provides comprehensive mathematical analysis and proofs
Methodological Innovation: First systematic combination of watermarking and poisoning techniques
Comprehensive Experiments: Full validation across multiple datasets, models, and attacks
Practical Value: Provides deployable solutions

Weaknesses

Insufficient Defense Consideration: Limited robustness against strong defense methods
Theoretical Completeness: Lacks necessary condition analysis
Limited Applicability: Primarily applicable to imperceptible attacks
Computational Efficiency: High overhead in certain scenarios

Impact

Academic Contribution: Pioneering combination of two important security domains
Practical Value: Provides new tools for AI safety and data protection
Theoretical Significance: Establishes new theoretical analysis framework
Industrial Application: Applicable to dataset copyright protection scenarios

Application Scenarios

Dataset Release: Copyright protection for open-source datasets
Artwork Protection: Preventing unauthorized use by generative AI
Enterprise Data Sharing: Internal data usage tracking
Academic Research: Source verification for research data

Technical Implementation Details

Algorithm Procedures

Post-Poisoning Watermarking Algorithm

def post_poisoning_watermark(poisoned_data, key, watermark_dims, budget):
    watermark = budget * sign(key[watermark_dims])
    watermarked_data = poisoned_data + watermark
    return watermarked_data

Detection Algorithm

def detect_watermark(suspect_data, key, threshold):
    detection_value = key.T @ suspect_data
    return 1 if detection_value > threshold else 0

Theoretical Guarantees

Based on McDiarmid's inequality, for post-poisoning watermarking:

When q > (2/ε_w)√(2d log(1/ω))
P(ζᵀ(x₁ + δ₁) > ζᵀx₂) > 1 - 2ω

Practical Deployment Considerations

Key Management: Support key rotation and HMAC authentication
Integrity Verification: SHA256 hashing ensures data integrity
Access Control: HTTPS-based secure key distribution
Scalability: Support large-scale dataset processing

Summary: This paper makes pioneering contributions at the intersection of data poisoning and watermarking techniques, providing not only rigorous theoretical analysis but also practical solutions. While there remains room for improvement in defense robustness and theoretical completeness, the problem it addresses has significant real-world importance, offering new research directions and tools for AI safety and data protection.