Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
Jang, Lee, Chung et al.
Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.
academic
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
This paper investigates the vulnerability of incomplete tokens in byte-level Byte Pair Encoding (BPE) tokenizers. The authors discover that these incomplete tokens, containing orphaned bytes, are heavily dependent on adjacent tokens and exhibit problems when paired with unfamiliar tokens. By constructing "improbable bigrams"—out-of-distribution combinations of incomplete tokens—the authors demonstrate that this vulnerability leads to significant hallucination behavior. Experiments show that using alternative tokenization methods substantially reduces hallucination rates for identical phrases (90% reduction in Llama3.1).
The core problem addressed in this paper is the vulnerability of incomplete tokens in byte-level BPE tokenizers, which causes large language models to produce hallucination behavior.
The authors hypothesize that incomplete tokens, due to their structural characteristics, exhibit vulnerability when paired with unfamiliar adjacent tokens, even if these tokens are adequately trained.
Input: Text phrases containing incomplete tokens
Output: Model responses to repetition tasks
Objective: Identify token combinations that prevent models from correctly repeating input phrases
Important Finding: Except for Command-R, all models showed significantly reduced hallucination rates with alternative tokenization, confirming that the problem originates from incomplete tokens.
Incomplete Tokens Have Systematic Vulnerabilities: Even when adequately trained, incomplete tokens readily cause hallucinations in specific combinations
Problem Originates from Tokenization, Not Training: Alternative tokenization significantly improves the problem, proving the root cause lies in token structure
Widespread Impact: This problem is prevalent across multiple mainstream models
Land, S. & Bartolo, M. (2024). Fishing for magikarp: Automatically detecting under-trained tokens in large language models.
Bostrom, K. & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units.
Limisiewicz, T. et al. (2024). MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling.
Overall Assessment: This is a high-quality research paper identifying important security vulnerabilities in byte-level BPE tokenizers. Despite some limitations, its originality, experimental rigor, and practical value make it a significant contribution to tokenizer security research. This work is important for enhancing the safety and robustness of large language models.