2025-11-10T02:43:43.995345

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Ahn, Park, Han
The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.
academic

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Basic Information

  • Paper ID: 2510.10987
  • Title: DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
  • Authors: Hyeseon Ahn, Shinwoo Park, Yo-Sub Han (Yonsei University)
  • Classification: cs.CR (Cryptography and Security), cs.AI (Artificial Intelligence)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10987
  • Code Link: https://github.com/hsannn/ditto.git

Abstract

Large language model (LLM) watermarking techniques are based on a critical assumption: that specific watermarks can prove authorship of specific models. This paper demonstrates that this assumption contains dangerous flaws. The authors propose a watermark spoofing attack threat, a sophisticated attack method that allows malicious models to generate text containing authentic watermarks from trusted victim models. This enables harmful content (such as misinformation) to be seamlessly misattributed to trusted sources. The key to the attack is transforming watermark radiosity (unintended inheritance of data patterns during fine-tuning) from a detectable artifact into an attack vector. By extracting knowledge from watermarked teacher models, the framework allows attackers to steal and replicate the watermark signals of victim models.

Research Background and Motivation

Problem Background

With the widespread application of large language models in industrial applications, education, and daily life, detection and verification of LLM-generated text has become critical. Regulatory authorities in the United States and European Union require clearer source traceability for LLM-generated content. Major industrial players (such as Meta, OpenAI, Google DeepMind) have adopted watermarking techniques as practical tools for source verification.

Core Problem

Existing LLM watermarking techniques are based on a fundamental assumption: detecting a specific watermark proves authorship by a specific model. However, this assumption contains serious vulnerabilities that could be maliciously exploited to spread misinformation and attribute it to trusted sources.

Research Motivation

  1. Security Threat Identification: Existing research primarily focuses on watermark removal attacks, with less attention to watermark forgery attacks
  2. Practical Harm: Watermark spoofing is more dangerous than removal because it creates misleading certainty
  3. Technical Flaw Exposure: Revealing fundamental security flaws in current watermark verification paradigms

Core Contributions

  1. First Weaponization of Watermark Radiosity: Transforming an originally detectable phenomenon into a powerful misattribution tool
  2. Highly Adaptive Attack Framework: Demonstrating attack effectiveness against n-gram and sampling-based watermarking schemes
  3. Breaking Strength-Quality Trade-offs: Discovering that attack strength can be significantly increased without obvious text quality degradation
  4. Systematic Security Assessment: First systematic evaluation of spoofing attack threats against LLM watermarks

Methodology Details

Task Definition

Given a watermarked model M_T as the target, an attacker wishes to train another model M that can generate text containing M_T's watermark signals, thereby deceiving the watermark detector. The attack operates in a black-box setting where the attacker cannot access the target model's logits or specific watermarking scheme information.

DITTO Framework Architecture

The DITTO framework comprises three main stages:

1. Watermark Inheritance

Transferring the target model's watermark patterns to an open-source student model through knowledge distillation:

θS = arg max Σ Σ log P(xi|x1:i-1; θO)
     θO    x∈DT i=1

Where D_T is a dataset generated by the watermarked teacher model M_T, and θ_S and θ_O are parameters of the student and original models, respectively.

2. Watermark Extraction

Extracting watermark signals by analyzing logits differences before and after training:

Global Bias:

δglobal = Ec∈DT[lMS(c)] - Ec∈DT[lMO(c)]

Local Bias:

δp = Ec∈DT|c ends with p[lMS(c)] - Ec∈DT|c ends with p[lMO(c)]

Final Extracted Signal:

EWS(c) = δglobal + Σ w(p) · δp
                   p∈prefixes(c)

3. Spoofing Attack

Injecting extracted watermark signals into the attacker's model during inference:

l'MO(c) = lMO(c) + α · EWS(c)

Where α is a scaling parameter controlling injection strength.

Technical Innovations

  1. Exploiting Watermark Radiosity: Innovatively transforming watermark radiosity from a detection tool into an attack vector
  2. Scheme Agnosticism: Not dependent on specific watermarking scheme implementation details
  3. Real-time Injection Mechanism: Dynamically injecting watermark signals during the inference phase
  4. Black-box Attack Setting: Attacking under realistic constraints

Experimental Setup

Datasets

  1. Dolly-15k: Contains 15,000 human-generated prompt/response pairs for instruction fine-tuning
  2. MarkMyWords (MMW) Bookreport: A benchmark specifically designed for systematic evaluation of watermarking techniques

Model Configuration

  • Teacher-Student Model Pairs:
    • Llama3.1-8B → Llama3.2-3B
    • Llama3.2-3B → Llama3.2-1B

Evaluation Metrics

  1. TPR@FPR: True positive rate at fixed false positive rates (10%, 1%, 0.1%)
  2. p-value: Statistical significance of watermark detection (median)
  3. Perplexity: Text quality assessment metric

Baseline Methods

  • JSV (Jovanović et al., 2024)
  • De-Mark (Chen et al., 2025) - Gray-box and black-box settings
  • Original watermarked model as upper-bound baseline

Implementation Details

  • Watermarking parameters: δ=3, γ=0.5, z-threshold=4.0
  • Training: LoRA fine-tuning for 3 epochs
  • Attack strength: α ∈ 2.5, 3, 3.5, 4, 4.5, 5

Experimental Results

Main Results

On the MMW Bookreport dataset, DITTO attacking Llama3.1-8B:

  • TPR@FPR=10%: 0.81
  • TPR@FPR=1%: 0.70
  • TPR@FPR=0.1%: 0.51
  • Median p-value: 7.97E-04
  • Perplexity: 4.18

Superior performance on Llama3.2-3B:

  • TPR@FPR=10%: 0.99
  • TPR@FPR=1%: 0.99
  • TPR@FPR=0.1%: 0.97
  • Median p-value: 5.48E-17
  • Perplexity: 2.44

Key Findings

1. Non-traditional Relationship Between Attack Strength and Text Quality

Experiments reveal that as the scaling parameter α increases, perplexity does not increase monotonically but exhibits fluctuating patterns. This breaks the conventional assumption that "stronger attacks necessarily lead to quality degradation."

2. Cross-scheme Generalizability

DITTO is equally effective against SynthID (sampling-based watermarking):

  • Llama3.1-8B: TPR@10%=0.88, p-value=7.10E-10
  • Llama3.2-3B: TPR@10%=0.90, p-value=8.12E-12

3. Model Scale Impact

Smaller models as attack carriers perform better, possibly because they more easily learn and replicate watermark patterns.

Ablation Studies

Experiments varying the α parameter (2.5-5.0) demonstrate:

  • p-value continuously decreases with increasing α
  • Perplexity changes irregularly without obvious quality degradation trends

LLM Watermarking Techniques

  1. Vocabulary Partition-based Methods: KGW scheme and its improved versions
  2. Sampling-based Methods: SynthID, Tournament sampling, etc.
  3. Multi-bit Schemes: Supporting user-traceable identifiers

Watermark Attack Research

  1. Removal Attacks: Removing watermarks through paraphrasing, optimization, etc.
  2. Theft Attacks: Reverse-engineering watermarking mechanisms
  3. Spoofing Attacks: The focus of this paper, with relatively limited existing research

Watermark Radiosity

  • Detection Applications: Used by Sander et al. for source auditing
  • Defense Research: Neutralization methods by Pan et al.
  • Attack Transformation: First weaponization in this paper

Conclusions and Discussion

Main Conclusions

  1. Fundamental Security Flaw: The core assumption of current watermarking techniques contains serious vulnerabilities
  2. Practical Attack Threat: DITTO can effectively attack even in black-box settings
  3. Paradigm Shift Requirement: Need to shift from detection existence to authenticity verification

Limitations

  1. Dependence on Watermark Inheritance: Attack success depends on faithful watermark inheritance by student models
  2. Lack of Defense Mechanism Research: Paper focuses on attacks without exploring corresponding defenses
  3. Limited Scheme Coverage: Only two major watermarking types tested

Future Directions

  1. Robust Watermark Design: Developing watermarking techniques resistant to spoofing
  2. Authenticity Verification: Methods to distinguish genuine from imitated watermarks
  3. Cryptographic Approaches: Mechanisms binding watermarks to model identity

In-depth Evaluation

Strengths

  1. Important Security Discovery: Reveals fundamental security issues in watermarking techniques
  2. Methodological Innovation: First systematic exploitation of watermark radiosity for attacks
  3. Comprehensive Experiments: Full evaluation across multiple models, datasets, and watermarking schemes
  4. Practical Threat Value: Black-box attack setting under realistic constraints

Weaknesses

  1. Ethical Risks: Provides attack methods potentially subject to malicious exploitation
  2. Defense Absence: No corresponding defense or mitigation strategies provided
  3. Insufficient Theoretical Analysis: Lacks theoretical analysis of attack success conditions
  4. Limited Scheme Coverage: Only limited watermarking schemes tested

Impact

  1. Academic Contribution: Opens new directions for watermarking security research
  2. Practical Value: Alerts to security risks of current watermarking techniques
  3. Policy Impact: May influence development of relevant regulatory policies

Applicable Scenarios

  1. Security Assessment: Evaluating security of existing watermarking systems
  2. Red Team Testing: Offensive testing tool for AI security teams
  3. Research Benchmark: Attack baseline for subsequent defense research

References

This paper cites important research in watermarking techniques, attack methods, and AI security, including:

  • Kirchenbauer et al. (2023) - KGW watermarking scheme
  • Dathathri et al. (2024) - SynthID sampling-based watermarking
  • Sander et al. (2024) - Watermark radiosity concept
  • Multiple related works on watermark attacks and defenses

Overall Assessment: This is a paper of significant security importance that reveals fundamental vulnerabilities in current LLM watermarking techniques. While ethically controversial, its academic value and contribution to field development are undeniable. The paper provides clear direction for future development of more secure watermarking techniques.