DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Ahn, Park, Han
The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.git.
academic
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Large language model (LLM) watermarking techniques are based on a critical assumption: that specific watermarks can prove authorship of specific models. This paper demonstrates that this assumption contains dangerous flaws. The authors propose a watermark spoofing attack threat, a sophisticated attack method that allows malicious models to generate text containing authentic watermarks from trusted victim models. This enables harmful content (such as misinformation) to be seamlessly misattributed to trusted sources. The key to the attack is transforming watermark radiosity (unintended inheritance of data patterns during fine-tuning) from a detectable artifact into an attack vector. By extracting knowledge from watermarked teacher models, the framework allows attackers to steal and replicate the watermark signals of victim models.
With the widespread application of large language models in industrial applications, education, and daily life, detection and verification of LLM-generated text has become critical. Regulatory authorities in the United States and European Union require clearer source traceability for LLM-generated content. Major industrial players (such as Meta, OpenAI, Google DeepMind) have adopted watermarking techniques as practical tools for source verification.
Existing LLM watermarking techniques are based on a fundamental assumption: detecting a specific watermark proves authorship by a specific model. However, this assumption contains serious vulnerabilities that could be maliciously exploited to spread misinformation and attribute it to trusted sources.
Given a watermarked model M_T as the target, an attacker wishes to train another model M that can generate text containing M_T's watermark signals, thereby deceiving the watermark detector. The attack operates in a black-box setting where the attacker cannot access the target model's logits or specific watermarking scheme information.
Transferring the target model's watermark patterns to an open-source student model through knowledge distillation:
θS = arg max Σ Σ log P(xi|x1:i-1; θO)
θO x∈DT i=1
Where D_T is a dataset generated by the watermarked teacher model M_T, and θ_S and θ_O are parameters of the student and original models, respectively.
Experiments reveal that as the scaling parameter α increases, perplexity does not increase monotonically but exhibits fluctuating patterns. This breaks the conventional assumption that "stronger attacks necessarily lead to quality degradation."
This paper cites important research in watermarking techniques, attack methods, and AI security, including:
Kirchenbauer et al. (2023) - KGW watermarking scheme
Dathathri et al. (2024) - SynthID sampling-based watermarking
Sander et al. (2024) - Watermark radiosity concept
Multiple related works on watermark attacks and defenses
Overall Assessment: This is a paper of significant security importance that reveals fundamental vulnerabilities in current LLM watermarking techniques. While ethically controversial, its academic value and contribution to field development are undeniable. The paper provides clear direction for future development of more secure watermarking techniques.