2025-11-20T03:28:13.599904

Provable Watermarking for Data Poisoning Attacks

Zhu, Yu, Gao
In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: {\em post-poisoning watermarking} and {\em poisoning-concurrent watermarking}. Our analyses demonstrate that when the watermarking length is $Θ(\sqrt{d}/ε_w)$ for post-poisoning watermarking, and falls within the range of $Θ(1/ε_w^2)$ to $O(\sqrt{d}/ε_p)$ for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.
academic

Provable Watermarking for Data Poisoning Attacks

Basic Information

  • Paper ID: 2510.09210
  • Title: Provable Watermarking for Data Poisoning Attacks
  • Authors: Yifan Zhu, Lijia Yu, Xiao-Shan Gao
  • Categories: cs.CR (Cryptography and Security), cs.LG (Machine Learning)
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.09210

Abstract

In recent years, data poisoning attacks have increasingly been designed in seemingly benign or even beneficial forms, commonly used for dataset ownership verification or protecting private data from unauthorized use. However, these developments may lead to misunderstandings and conflicts, as data poisoning has traditionally been viewed as a security threat to machine learning systems. To address this issue, benign poisoning generators must declare ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. This paper proposes deploying watermarking schemes as a solution to this challenge, introducing two provably secure and practical data poisoning watermarking methods: post-poisoning watermarking and poisoning-concurrent watermarking. Analysis demonstrates that when watermark length is Θ(√d/ε_w) (post-poisoning watermarking) and ranges from Θ(1/ε_w²) to O(√d/ε_p) (poisoning-concurrent watermarking), watermarked poisoned datasets provably ensure watermark detectability and poisoning utility.

Research Background and Motivation

Problem Definition

  1. Paradigm Shift: Data poisoning attacks are transitioning from traditional malicious threats to "benign" applications, such as dataset ownership verification and preventing unauthorized use
  2. Transparency Issues: When poisoning is used for protective purposes, authorized users may inadvertently use poisoned data, leading to misunderstandings and conflicts
  3. Lack of Accountability: Existing detection methods lack a unified framework and provable declaration mechanisms

Significance

  • As large-scale model training increasingly relies on web-scraped or synthetic data, the impact of data poisoning becomes more pronounced
  • Artists and data creators need to protect their intellectual property from unauthorized use by generative AI
  • A balance must be established between data protection and transparency

Limitations of Existing Methods

  • Detection methods vary by attack type, making unification difficult
  • Based on heuristic training algorithms, lacking provable mechanisms
  • Cannot provide clear, verifiable declarations for poisoned datasets

Core Contributions

  1. First Framework for Data Poisoning Watermarking: Applies watermarking techniques to data poisoning scenarios, providing transparency and accountability
  2. Two Watermarking Schemes:
    • Post-poisoning watermarking: Third-party entities create watermarks for already-poisoned datasets
    • Poisoning-concurrent watermarking: Poisoning generators simultaneously create watermarks and poisoning
  3. Theoretical Guarantees: Provides rigorous theoretical analysis of watermark detectability and poisoning utility
  4. Practical Validation: Verifies theoretical findings across multiple attacks, models, and datasets

Methodology Details

Task Definition

  • Input: Original dataset D, poisoning budget ε_p, watermarking budget ε_w
  • Output: Watermarked poisoned dataset, detection key ζ
  • Constraints: Maintain poisoning utility while ensuring watermark detectability

Model Architecture

1. Post-Poisoning Watermarking

Original data x → Poisoning δ_p → Poisoned data x' → Watermarking δ_w → Final data x' + δ_w
  • Third-party entities add watermarks to already-poisoned data
  • Total perturbation budget: ε_p + ε_w
  • Watermark length requirement: Θ(√d/ε_w)

2. Poisoning-Concurrent Watermarking

Original data x → Simultaneous poisoning and watermarking → Final data x + δ_p + δ_w
  • Poisoning generators simultaneously control poisoning and watermarking
  • Dimension separation: Watermark dimensions W, Poisoning dimensions P = d\W
  • Total perturbation budget: max{ε_p, ε_w}
  • Watermark length requirement: Θ(1/ε_w²) to O(√d/ε_p)

3. Detection Mechanism

  • Key: d-dimensional vector ζ
  • Detection: Compute inner product ζᵀx, compare with threshold
  • Decision: ζᵀ(poisoned data) > threshold > ζᵀ(normal data)

Technical Innovations

1. Theoretical Framework Innovation

  • Sample-level Analysis: Independent watermarking and keys for each data point
  • Universal Version: Single key applicable to all samples
  • Distribution Generalization: Extension from finite samples to overall distribution

2. Mathematical Guarantees

Using McDiarmid's inequality and VC dimension theory, proving:

  • Detectability: High-probability distinction between poisoned and normal data
  • Utility Preservation: Controllable impact of watermarking on poisoning effects
  • Generalization Performance: Extension of finite sample results to distributions

3. Dimension Separation Strategy

Poisoning-concurrent watermarking avoids interference through dimension separation:

  • Watermarking uses dimensions W = {d₁, d₂, ..., d_q}
  • Poisoning uses dimensions P = d\W
  • Reduces mutual interference and improves performance

Experimental Setup

Datasets

  • CIFAR-10/CIFAR-100: Classic image classification datasets
  • Tiny-ImageNet: Small-scale ImageNet
  • SST-2: Text sentiment analysis dataset

Attack Methods

Backdoor Attacks

  • Narcissus: Clean-label backdoor attack
  • AdvSc: Adversarial backdoor attack

Availability Attacks

  • UE (Unlearnable Examples): Unlearnable samples
  • AP (Adversarial Poisoning): Adversarial poisoning

Model Architectures

  • ResNet-18/50, VGG-19, DenseNet121
  • WRN34-10, MobileNet v2, ViT-B
  • BERT-base (text tasks)

Evaluation Metrics

  • Accuracy (Acc): Model performance on test set
  • Attack Success Rate (ASR): Effectiveness of backdoor attacks
  • AUROC: Watermark detection performance
  • Computational Overhead: Time cost analysis

Implementation Details

  • Watermarking/poisoning budget: 4/255 to 32/255
  • Watermark length: 100 to 3000
  • Training: 200 epochs, cosine learning rate scheduling
  • Optimizer: SGD, momentum 0.9, weight decay 10⁻⁴

Experimental Results

Main Results

1. Watermark Detection Performance

Watermark LengthNarcissus (Post)Narcissus (Concurrent)AdvSc (Post)AdvSc (Concurrent)
5000.95090.99680.92180.9986
10000.99740.99920.98090.9995
20001.00001.00000.99941.0000

2. Poisoning Utility Preservation

  • Post-poisoning watermarking: Maintains good attack performance across all watermark lengths
  • Poisoning-concurrent watermarking: Significant attack performance degradation with excessive watermark length

3. Theoretical Verification

Experimental results validate theoretical predictions:

  • Poisoning-concurrent watermarking requires shorter watermark lengths for equivalent detection performance
  • Post-poisoning watermarking has minimal impact on poisoning utility
  • Positive correlation between watermark length and detection performance

Ablation Studies

1. Watermarking Budget Impact

With increasing ε_w:

  • Detection performance (AUROC) improves
  • Poisoning effectiveness decreases
  • Validates theoretical trade-off relationships

2. Watermark Position Analysis

Testing different image regions (top-left, bottom-left, top-right, bottom-right):

  • Minimal position impact on performance
  • Validates position-independence in theory

3. Model Transferability

Demonstrates good transferability across architectures:

  • High AUROC scores (>0.95)
  • Stable cross-architecture detection

Robustness Analysis

1. Data Augmentation Resistance

Testing Random Flip, Cutout, Color Jitter, etc.:

  • AUROC maintains 1.0000
  • Demonstrates strong robustness

2. Defense Methods

  • Differential Privacy: Severe noise causes training failure
  • Diffusion Purification: Simultaneously corrupts watermarking and poisoning
  • Adversarial Denoising: Affects poisoning utility

Data Poisoning Research

  • Backdoor Attacks: BadNets, Narcissus, etc.
  • Availability Attacks: Unlearnable samples, adversarial poisoning
  • Defense Methods: Detection algorithms, data purification

Watermarking Techniques

  • Model Watermarking: Neural network copyright protection
  • Data Watermarking: Dataset ownership verification
  • Text Watermarking: Generated content detection in large language models

Technical Distinctions

This paper is the first to systematically apply watermarking techniques to data poisoning scenarios, providing theoretical guarantees and practical solutions.

Conclusions and Discussion

Main Conclusions

  1. Theoretical Contribution: Establishes theoretical framework for data poisoning watermarking
  2. Practical Solutions: Provides two deployable watermarking methods
  3. Performance Verification: Experiments confirm theoretical predictions
  4. Application Value: Provides transparency and accountability for "benign" poisoning

Limitations

  1. Unknown Necessary Conditions: Only provides sufficient conditions; necessary conditions require further research
  2. Defense Vulnerability: Performance degradation against strong defense methods
  3. Computational Overhead: Poisoning-concurrent watermarking requires additional computation time
  4. Limited Scope: Primarily targets imperceptible poisoning attacks

Future Directions

  1. Enhanced Robustness: Design watermarking schemes resistant to defenses
  2. Necessary Conditions: Explore necessary conditions for watermark detectability
  3. Efficiency Optimization: Reduce computational and storage overhead
  4. Application Extension: Extend to more poisoning types and domains

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses practical transparency needs in data poisoning
  2. Theoretical Rigor: Provides comprehensive mathematical analysis and proofs
  3. Methodological Innovation: First systematic combination of watermarking and poisoning techniques
  4. Comprehensive Experiments: Full validation across multiple datasets, models, and attacks
  5. Practical Value: Provides deployable solutions

Weaknesses

  1. Insufficient Defense Consideration: Limited robustness against strong defense methods
  2. Theoretical Completeness: Lacks necessary condition analysis
  3. Limited Applicability: Primarily applicable to imperceptible attacks
  4. Computational Efficiency: High overhead in certain scenarios

Impact

  1. Academic Contribution: Pioneering combination of two important security domains
  2. Practical Value: Provides new tools for AI safety and data protection
  3. Theoretical Significance: Establishes new theoretical analysis framework
  4. Industrial Application: Applicable to dataset copyright protection scenarios

Application Scenarios

  1. Dataset Release: Copyright protection for open-source datasets
  2. Artwork Protection: Preventing unauthorized use by generative AI
  3. Enterprise Data Sharing: Internal data usage tracking
  4. Academic Research: Source verification for research data

Technical Implementation Details

Algorithm Procedures

Post-Poisoning Watermarking Algorithm

def post_poisoning_watermark(poisoned_data, key, watermark_dims, budget):
    watermark = budget * sign(key[watermark_dims])
    watermarked_data = poisoned_data + watermark
    return watermarked_data

Detection Algorithm

def detect_watermark(suspect_data, key, threshold):
    detection_value = key.T @ suspect_data
    return 1 if detection_value > threshold else 0

Theoretical Guarantees

Based on McDiarmid's inequality, for post-poisoning watermarking:

  • When q > (2/ε_w)√(2d log(1/ω))
  • P(ζᵀ(x₁ + δ₁) > ζᵀx₂) > 1 - 2ω

Practical Deployment Considerations

  1. Key Management: Support key rotation and HMAC authentication
  2. Integrity Verification: SHA256 hashing ensures data integrity
  3. Access Control: HTTPS-based secure key distribution
  4. Scalability: Support large-scale dataset processing

Summary: This paper makes pioneering contributions at the intersection of data poisoning and watermarking techniques, providing not only rigorous theoretical analysis but also practical solutions. While there remains room for improvement in defense robustness and theoretical completeness, the problem it addresses has significant real-world importance, offering new research directions and tools for AI safety and data protection.