2025-11-12T07:34:10.386378

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Jang, Lee, Chung et al.

Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.

academic

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Basic Information

Paper ID: 2410.23684
Title: Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
Authors: Eugene Jang (Northeastern University), Kimin Lee (KAIST), Jin-Woo Chung (S2W Inc.), Keuntae Park (S2W Inc.), Seungwon Shin (KAIST)
Classification: cs.CL (Computational Linguistics)
Publication Date: October 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2410.23684

Abstract

This paper investigates the vulnerability of incomplete tokens in byte-level Byte Pair Encoding (BPE) tokenizers. The authors discover that these incomplete tokens, containing orphaned bytes, are heavily dependent on adjacent tokens and exhibit problems when paired with unfamiliar tokens. By constructing "improbable bigrams"—out-of-distribution combinations of incomplete tokens—the authors demonstrate that this vulnerability leads to significant hallucination behavior. Experiments show that using alternative tokenization methods substantially reduces hallucination rates for identical phrases (90% reduction in Llama3.1).

Research Background and Motivation

Core Problem

The core problem addressed in this paper is the vulnerability of incomplete tokens in byte-level BPE tokenizers, which causes large language models to produce hallucination behavior.

Problem Significance

Critical Role of Tokenization: Tokenization is a crucial step connecting human-readable text to discrete tokens processable by models
Existing Security Risks: Recent research demonstrates that tokenizers can be maliciously exploited to induce improper model behavior
Practical Harm: Tokenization issues may lead to data integrity loss, adversarial attacks, model fingerprinting, and other security risks

Limitations of Existing Approaches

Existing research primarily focuses on undertrained "glitch tokens"
Lacks systematic analysis of structural tokenization issues
Character-boundary-agnostic nature of byte-level BPE may produce structural vulnerable tokens

Research Motivation

The authors hypothesize that incomplete tokens, due to their structural characteristics, exhibit vulnerability when paired with unfamiliar adjacent tokens, even if these tokens are adequately trained.

Core Contributions

Identified Incomplete Token Vulnerabilities: Systematically analyzed structural characteristics and potential issues of incomplete tokens in byte-level BPE tokenizers
Proposed "Improbable Bigrams" Concept: Designed a novel attack method to expose incomplete token vulnerabilities
Cross-Model Verification: Verified the universal presence of this vulnerability across five mainstream large language models
Provided Mitigation Strategies: Demonstrated problem solvability through alternative tokenization methods and proposed preventive measures

Methodology Details

Task Definition

Input: Text phrases containing incomplete tokens Output: Model responses to repetition tasks Objective: Identify token combinations that prevent models from correctly repeating input phrases

Incomplete Token Analysis Method

1. Structural Analysis

UTF-8 Encoding Analysis: Based on the structure of start bytes and continuation bytes in UTF-8 multi-byte characters
Prefix/Suffix Classification:
- Prefix tokens: End with orphaned bytes, requiring additional bytes to complete characters
- Suffix tokens: Begin with orphaned bytes, providing bytes needed to complete characters

2. Bigram Construction Workflow

Step 1: Structural Analysis
- Identify start bytes and continuation bytes in tokens
- Determine bytes required or provided by tokens

Step 2: Compatibility Matching
- Find structurally complementary token pairs
- Ensure combinations form valid Unicode characters

Step 3: Feasibility Verification
- Execute decode-encode tests
- Verify generated strings tokenize as expected

Characteristics of Improbable Bigrams

Multilingual Nature: Combined characters originate from different Unicode script systems
Out-of-Distribution Properties: Such cross-script combinations are extremely unlikely in training data
Structural Dependency: Two tokens must cooperate to form valid characters

Technical Innovations

Systematic Vulnerability Discovery: First systematic identification of structural vulnerabilities in byte-level BPE
Precise Attack Construction: Exact attack sample construction based on UTF-8 encoding rules
Training Quality Independence: Demonstrates that even adequately trained tokens may exhibit vulnerabilities

Experimental Setup

Model Selection

Tested five instruction-tuned models using byte-level BPE:

Meta-Llama-3.1-8B-Instruct (vocabulary 128k, 1224 incomplete tokens)
EXAONE-3.0-7.8B-Instruct (vocabulary 102k, 1222 incomplete tokens)
Qwen2.5-32B-Instruct (vocabulary 151k, 1320 incomplete tokens)
Mistral-Nemo-Instruct-2407 (vocabulary 131k, 1307 incomplete tokens)
C4AI-Command-R-v01 (vocabulary 255k, 2956 incomplete tokens)

Evaluation Task Design

Used four prompt templates to test model ability to repeat target phrases:

Task Type	Prompt Template
Direct Repetition	"Repeat this phrase exactly: '{Phrase}'"
Definition Query	"What does '{Phrase}' mean?"
Knowledge Query	"Today I heard about '{Phrase}'. Do you know what this means?"
Code Scenario	Python code with username list output

Token Selection Strategy

Training Quality Filtering: Used embedding heuristic method from Land and Bartolo (2024) to exclude undertrained tokens
Focus on Adequately Trained Tokens: Used only tokens in top 50% training quality ranking from vocabulary
Construct Improbable Bigrams: Constructed up to 100 improbable bigrams per model

Baseline Comparison

Constructed control groups with complete tokens for each improbable bigram:

Selected alternatives with similar training levels but complete tokens
Ensured fairness of comparative experiments

Experimental Results

Main Results

Model	Improbable Bigram Hallucination Rate	Baseline Bigram Hallucination Rate
Llama 3.1	48/100 (48%)	0/100 (0%)
Exaone	77/100 (77%)	20/100 (20%)
Qwen2.5	33/100 (33%)	0/100 (0%)
Mistral-Nemo	52/71 (73%)	1/71 (1%)
Command-R	49/100 (49%)	8/100 (8%)

Key Finding: Improbable bigrams composed of incomplete tokens demonstrate significantly higher hallucination rates across all models.

Alternative Tokenization Experiment Results

Model	Original Tokenization Hallucination Rate	Alternative Tokenization Hallucination Rate	Improvement
Llama 3.1	0.48	0.05	↓90%
Exaone	0.77	0.50	↓35%
Qwen2.5	0.33	0.12	↓64%
Mistral-Nemo	0.73	0.01	↓98%
Command-R	0.49	0.55	No improvement

Important Finding: Except for Command-R, all models showed significantly reduced hallucination rates with alternative tokenization, confirming that the problem originates from incomplete tokens.

Language Distribution Analysis

Improbable bigrams span multiple language pair combinations
High-resource multi-byte scripts (Chinese, Korean, Russian) appear most frequently
Language pair distributions vary significantly across models (Exaone has 17 language pairs, Command-R only 3)

Tokenizer Vulnerability Research

Glitch Token Research: Land and Bartolo (2024) proposed embedding layer heuristic methods to identify undertrained tokens
Adversarial Tokenization: Wang et al. (2024) created adversarial problems inducing incorrect tokenization
Tokenization Fairness: Petrov et al. (2023) and Ovalle et al. (2024) studied unfairness and bias introduced by tokenizers

BPE Tokenizer Research

Compression Effectiveness Questioning: Schmidt et al. (2024) challenged assumptions about BPE effectiveness deriving from compression
Greedy Compression Issues: Bostrom and Durrett (2020) noted that greedy compression prioritizes frequency over linguistic significance
Morphological Improvements: Limisiewicz et al. (2024) and Bauwens et al. (2024) proposed morphology-driven BPE improvements

Uniqueness of This Paper's Contribution

Unlike existing research, this paper:

Focuses on structural rather than training quality issues
Demonstrates that adequately trained tokens may still be vulnerable
Provides systematic attack construction methodology

Conclusions and Discussion

Main Conclusions

Incomplete Tokens Have Systematic Vulnerabilities: Even when adequately trained, incomplete tokens readily cause hallucinations in specific combinations
Problem Originates from Tokenization, Not Training: Alternative tokenization significantly improves the problem, proving the root cause lies in token structure
Widespread Impact: This problem is prevalent across multiple mainstream models

Practical Risks

Code and Data Processing: May compromise integrity of variable names or fixed values
Adversarial Non-Reproducibility: Attackers can exploit non-repeatable phrases to evade LLM agent intervention
Model Fingerprinting: Can be used to identify architecture behind anonymous LLM services

Mitigation Strategies

Vocabulary Pruning: Remove incomplete tokens before model training
Constrained BPE Merging: Respect character boundaries during tokenizer training
Character-Level Tokenization: For models not requiring complete Unicode coverage, character-level tokenization is an option

Limitations

Evaluation Scope: Limited to phrase-level hallucinations, without systematic evaluation of factual hallucinations
Language Expertise: Test phrases span multiple languages, exceeding authors' expertise
Model Specificity: Anomalous results for Command-R model require further investigation

Future Directions

Safer Tokenizer Design: Develop tokenization methods avoiding incomplete tokens
Robustness Evaluation: Establish comprehensive tokenization vulnerability assessment frameworks
Defense Mechanism Research: Explore runtime detection and mitigation strategies

In-Depth Evaluation

Strengths

Originality of Problem Identification: First systematic identification of structural vulnerabilities in byte-level BPE
Methodological Rigor: Precise attack construction based on UTF-8 encoding rules with well-designed experiments
Experimental Comprehensiveness: Verification across multiple models and languages with convincing results
Practical Value: Provides concrete mitigation strategies and security recommendations

Weaknesses

Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why incomplete tokens are more vulnerable
Unexplained Command-R Anomaly: Analysis of anomalous results for this model lacks depth
Limited Evaluation Metrics: Uses only repetition tasks, potentially not fully reflecting actual harm
Unknown Long-Term Impact: Does not evaluate vulnerability's effects on other model capabilities

Impact

Academic Contribution: Opens new directions for tokenizer security research
Practical Value: Provides important security considerations for model developers
Reproducibility: Clear methodology description enables experiment reproduction
Policy Implications: May influence future tokenizer design standards

Applicable Scenarios

Model Security Assessment: Evaluate existing models' tokenization vulnerabilities
Tokenizer Design: Guide development of safer tokenizers
Adversarial Testing: Part of model robustness testing
Security Audits: Pre-deployment security checks for LLMs

References

Key References:

Land, S. & Bartolo, M. (2024). Fishing for magikarp: Automatically detecting under-trained tokens in large language models.
Bostrom, K. & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units.
Limisiewicz, T. et al. (2024). MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling.

Overall Assessment: This is a high-quality research paper identifying important security vulnerabilities in byte-level BPE tokenizers. Despite some limitations, its originality, experimental rigor, and practical value make it a significant contribution to tokenizer security research. This work is important for enhancing the safety and robustness of large language models.