2025-11-10T02:43:43.995345

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Ahn, Park, Han

The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.

academic

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Basic Information

Paper ID: 2510.10987
Title: DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Authors: Hyeseon Ahn, Shinwoo Park, Yo-Sub Han (Yonsei University)
Classification: cs.CR (Cryptography and Security), cs.AI (Artificial Intelligence)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10987
Code Link: https://github.com/hsannn/ditto.git

Abstract

Large language model (LLM) watermarking techniques are based on a critical assumption: that specific watermarks can prove authorship of specific models. This paper demonstrates that this assumption contains dangerous flaws. The authors propose a watermark spoofing attack threat, a sophisticated attack method that allows malicious models to generate text containing authentic watermarks from trusted victim models. This enables harmful content (such as misinformation) to be seamlessly misattributed to trusted sources. The key to the attack is transforming watermark radiosity (unintended inheritance of data patterns during fine-tuning) from a detectable artifact into an attack vector. By extracting knowledge from watermarked teacher models, the framework allows attackers to steal and replicate the watermark signals of victim models.

Research Background and Motivation

Problem Background

With the widespread application of large language models in industrial applications, education, and daily life, detection and verification of LLM-generated text has become critical. Regulatory authorities in the United States and European Union require clearer source traceability for LLM-generated content. Major industrial players (such as Meta, OpenAI, Google DeepMind) have adopted watermarking techniques as practical tools for source verification.

Core Problem

Existing LLM watermarking techniques are based on a fundamental assumption: detecting a specific watermark proves authorship by a specific model. However, this assumption contains serious vulnerabilities that could be maliciously exploited to spread misinformation and attribute it to trusted sources.

Research Motivation

Security Threat Identification: Existing research primarily focuses on watermark removal attacks, with less attention to watermark forgery attacks
Practical Harm: Watermark spoofing is more dangerous than removal because it creates misleading certainty
Technical Flaw Exposure: Revealing fundamental security flaws in current watermark verification paradigms

Core Contributions

First Weaponization of Watermark Radiosity: Transforming an originally detectable phenomenon into a powerful misattribution tool
Highly Adaptive Attack Framework: Demonstrating attack effectiveness against n-gram and sampling-based watermarking schemes
Breaking Strength-Quality Trade-offs: Discovering that attack strength can be significantly increased without obvious text quality degradation
Systematic Security Assessment: First systematic evaluation of spoofing attack threats against LLM watermarks

Methodology Details

Task Definition

Given a watermarked model M_T as the target, an attacker wishes to train another model M that can generate text containing M_T's watermark signals, thereby deceiving the watermark detector. The attack operates in a black-box setting where the attacker cannot access the target model's logits or specific watermarking scheme information.

DITTO Framework Architecture

The DITTO framework comprises three main stages:

1. Watermark Inheritance

Transferring the target model's watermark patterns to an open-source student model through knowledge distillation:

θS = arg max Σ Σ log P(xi|x1:i-1; θO)
     θO    x∈DT i=1

Where D_T is a dataset generated by the watermarked teacher model M_T, and θ_S and θ_O are parameters of the student and original models, respectively.

2. Watermark Extraction

Extracting watermark signals by analyzing logits differences before and after training:

Global Bias:

δglobal = Ec∈DT[lMS(c)] - Ec∈DT[lMO(c)]

Local Bias:

δp = Ec∈DT|c ends with p[lMS(c)] - Ec∈DT|c ends with p[lMO(c)]

Final Extracted Signal:

EWS(c) = δglobal + Σ w(p) · δp
                   p∈prefixes(c)

3. Spoofing Attack

Injecting extracted watermark signals into the attacker's model during inference:

l'MO(c) = lMO(c) + α · EWS(c)

Where α is a scaling parameter controlling injection strength.

Technical Innovations

Exploiting Watermark Radiosity: Innovatively transforming watermark radiosity from a detection tool into an attack vector
Scheme Agnosticism: Not dependent on specific watermarking scheme implementation details
Real-time Injection Mechanism: Dynamically injecting watermark signals during the inference phase
Black-box Attack Setting: Attacking under realistic constraints

Experimental Setup

Datasets

Dolly-15k: Contains 15,000 human-generated prompt/response pairs for instruction fine-tuning
MarkMyWords (MMW) Bookreport: A benchmark specifically designed for systematic evaluation of watermarking techniques

Model Configuration

Teacher-Student Model Pairs:
- Llama3.1-8B → Llama3.2-3B
- Llama3.2-3B → Llama3.2-1B

Evaluation Metrics

TPR@FPR: True positive rate at fixed false positive rates (10%, 1%, 0.1%)
p-value: Statistical significance of watermark detection (median)
Perplexity: Text quality assessment metric

Baseline Methods

JSV (Jovanović et al., 2024)
De-Mark (Chen et al., 2025) - Gray-box and black-box settings
Original watermarked model as upper-bound baseline

Implementation Details

Watermarking parameters: δ=3, γ=0.5, z-threshold=4.0
Training: LoRA fine-tuning for 3 epochs
Attack strength: α ∈ 2.5, 3, 3.5, 4, 4.5, 5

Experimental Results

Main Results

On the MMW Bookreport dataset, DITTO attacking Llama3.1-8B:

TPR@FPR=10%: 0.81
TPR@FPR=1%: 0.70
TPR@FPR=0.1%: 0.51
Median p-value: 7.97E-04
Perplexity: 4.18

Superior performance on Llama3.2-3B:

TPR@FPR=10%: 0.99
TPR@FPR=1%: 0.99
TPR@FPR=0.1%: 0.97
Median p-value: 5.48E-17
Perplexity: 2.44

Key Findings

1. Non-traditional Relationship Between Attack Strength and Text Quality

Experiments reveal that as the scaling parameter α increases, perplexity does not increase monotonically but exhibits fluctuating patterns. This breaks the conventional assumption that "stronger attacks necessarily lead to quality degradation."

2. Cross-scheme Generalizability

DITTO is equally effective against SynthID (sampling-based watermarking):

Llama3.1-8B: TPR@10%=0.88, p-value=7.10E-10
Llama3.2-3B: TPR@10%=0.90, p-value=8.12E-12

3. Model Scale Impact

Smaller models as attack carriers perform better, possibly because they more easily learn and replicate watermark patterns.

Ablation Studies

Experiments varying the α parameter (2.5-5.0) demonstrate:

p-value continuously decreases with increasing α
Perplexity changes irregularly without obvious quality degradation trends

LLM Watermarking Techniques

Vocabulary Partition-based Methods: KGW scheme and its improved versions
Sampling-based Methods: SynthID, Tournament sampling, etc.
Multi-bit Schemes: Supporting user-traceable identifiers

Watermark Attack Research

Removal Attacks: Removing watermarks through paraphrasing, optimization, etc.
Theft Attacks: Reverse-engineering watermarking mechanisms
Spoofing Attacks: The focus of this paper, with relatively limited existing research

Watermark Radiosity

Detection Applications: Used by Sander et al. for source auditing
Defense Research: Neutralization methods by Pan et al.
Attack Transformation: First weaponization in this paper

Conclusions and Discussion

Main Conclusions

Fundamental Security Flaw: The core assumption of current watermarking techniques contains serious vulnerabilities
Practical Attack Threat: DITTO can effectively attack even in black-box settings
Paradigm Shift Requirement: Need to shift from detection existence to authenticity verification

Limitations

Dependence on Watermark Inheritance: Attack success depends on faithful watermark inheritance by student models
Lack of Defense Mechanism Research: Paper focuses on attacks without exploring corresponding defenses
Limited Scheme Coverage: Only two major watermarking types tested

Future Directions

Robust Watermark Design: Developing watermarking techniques resistant to spoofing
Authenticity Verification: Methods to distinguish genuine from imitated watermarks
Cryptographic Approaches: Mechanisms binding watermarks to model identity

In-depth Evaluation

Strengths

Important Security Discovery: Reveals fundamental security issues in watermarking techniques
Methodological Innovation: First systematic exploitation of watermark radiosity for attacks
Comprehensive Experiments: Full evaluation across multiple models, datasets, and watermarking schemes
Practical Threat Value: Black-box attack setting under realistic constraints

Weaknesses

Ethical Risks: Provides attack methods potentially subject to malicious exploitation
Defense Absence: No corresponding defense or mitigation strategies provided
Insufficient Theoretical Analysis: Lacks theoretical analysis of attack success conditions
Limited Scheme Coverage: Only limited watermarking schemes tested

Impact

Academic Contribution: Opens new directions for watermarking security research
Practical Value: Alerts to security risks of current watermarking techniques
Policy Impact: May influence development of relevant regulatory policies

Applicable Scenarios

Security Assessment: Evaluating security of existing watermarking systems
Red Team Testing: Offensive testing tool for AI security teams
Research Benchmark: Attack baseline for subsequent defense research

References

This paper cites important research in watermarking techniques, attack methods, and AI security, including:

Kirchenbauer et al. (2023) - KGW watermarking scheme
Dathathri et al. (2024) - SynthID sampling-based watermarking
Sander et al. (2024) - Watermark radiosity concept
Multiple related works on watermark attacks and defenses

Overall Assessment: This is a paper of significant security importance that reveals fundamental vulnerabilities in current LLM watermarking techniques. While ethically controversial, its academic value and contribution to field development are undeniable. The paper provides clear direction for future development of more secure watermarking techniques.