2025-11-12T07:34:10.386378

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Jang, Lee, Chung et al.
Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.
academic

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Basic Information

  • Paper ID: 2410.23684
  • Title: Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
  • Authors: Eugene Jang (Northeastern University), Kimin Lee (KAIST), Jin-Woo Chung (S2W Inc.), Keuntae Park (S2W Inc.), Seungwon Shin (KAIST)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2410.23684

Abstract

This paper investigates the vulnerability of incomplete tokens in byte-level Byte Pair Encoding (BPE) tokenizers. The authors discover that these incomplete tokens, containing orphaned bytes, are heavily dependent on adjacent tokens and exhibit problems when paired with unfamiliar tokens. By constructing "improbable bigrams"—out-of-distribution combinations of incomplete tokens—the authors demonstrate that this vulnerability leads to significant hallucination behavior. Experiments show that using alternative tokenization methods substantially reduces hallucination rates for identical phrases (90% reduction in Llama3.1).

Research Background and Motivation

Core Problem

The core problem addressed in this paper is the vulnerability of incomplete tokens in byte-level BPE tokenizers, which causes large language models to produce hallucination behavior.

Problem Significance

  1. Critical Role of Tokenization: Tokenization is a crucial step connecting human-readable text to discrete tokens processable by models
  2. Existing Security Risks: Recent research demonstrates that tokenizers can be maliciously exploited to induce improper model behavior
  3. Practical Harm: Tokenization issues may lead to data integrity loss, adversarial attacks, model fingerprinting, and other security risks

Limitations of Existing Approaches

  • Existing research primarily focuses on undertrained "glitch tokens"
  • Lacks systematic analysis of structural tokenization issues
  • Character-boundary-agnostic nature of byte-level BPE may produce structural vulnerable tokens

Research Motivation

The authors hypothesize that incomplete tokens, due to their structural characteristics, exhibit vulnerability when paired with unfamiliar adjacent tokens, even if these tokens are adequately trained.

Core Contributions

  1. Identified Incomplete Token Vulnerabilities: Systematically analyzed structural characteristics and potential issues of incomplete tokens in byte-level BPE tokenizers
  2. Proposed "Improbable Bigrams" Concept: Designed a novel attack method to expose incomplete token vulnerabilities
  3. Cross-Model Verification: Verified the universal presence of this vulnerability across five mainstream large language models
  4. Provided Mitigation Strategies: Demonstrated problem solvability through alternative tokenization methods and proposed preventive measures

Methodology Details

Task Definition

Input: Text phrases containing incomplete tokens Output: Model responses to repetition tasks Objective: Identify token combinations that prevent models from correctly repeating input phrases

Incomplete Token Analysis Method

1. Structural Analysis

  • UTF-8 Encoding Analysis: Based on the structure of start bytes and continuation bytes in UTF-8 multi-byte characters
  • Prefix/Suffix Classification:
    • Prefix tokens: End with orphaned bytes, requiring additional bytes to complete characters
    • Suffix tokens: Begin with orphaned bytes, providing bytes needed to complete characters

2. Bigram Construction Workflow

Step 1: Structural Analysis
- Identify start bytes and continuation bytes in tokens
- Determine bytes required or provided by tokens

Step 2: Compatibility Matching
- Find structurally complementary token pairs
- Ensure combinations form valid Unicode characters

Step 3: Feasibility Verification
- Execute decode-encode tests
- Verify generated strings tokenize as expected

Characteristics of Improbable Bigrams

  1. Multilingual Nature: Combined characters originate from different Unicode script systems
  2. Out-of-Distribution Properties: Such cross-script combinations are extremely unlikely in training data
  3. Structural Dependency: Two tokens must cooperate to form valid characters

Technical Innovations

  1. Systematic Vulnerability Discovery: First systematic identification of structural vulnerabilities in byte-level BPE
  2. Precise Attack Construction: Exact attack sample construction based on UTF-8 encoding rules
  3. Training Quality Independence: Demonstrates that even adequately trained tokens may exhibit vulnerabilities

Experimental Setup

Model Selection

Tested five instruction-tuned models using byte-level BPE:

  • Meta-Llama-3.1-8B-Instruct (vocabulary 128k, 1224 incomplete tokens)
  • EXAONE-3.0-7.8B-Instruct (vocabulary 102k, 1222 incomplete tokens)
  • Qwen2.5-32B-Instruct (vocabulary 151k, 1320 incomplete tokens)
  • Mistral-Nemo-Instruct-2407 (vocabulary 131k, 1307 incomplete tokens)
  • C4AI-Command-R-v01 (vocabulary 255k, 2956 incomplete tokens)

Evaluation Task Design

Used four prompt templates to test model ability to repeat target phrases:

Task TypePrompt Template
Direct Repetition"Repeat this phrase exactly: '{Phrase}'"
Definition Query"What does '{Phrase}' mean?"
Knowledge Query"Today I heard about '{Phrase}'. Do you know what this means?"
Code ScenarioPython code with username list output

Token Selection Strategy

  1. Training Quality Filtering: Used embedding heuristic method from Land and Bartolo (2024) to exclude undertrained tokens
  2. Focus on Adequately Trained Tokens: Used only tokens in top 50% training quality ranking from vocabulary
  3. Construct Improbable Bigrams: Constructed up to 100 improbable bigrams per model

Baseline Comparison

Constructed control groups with complete tokens for each improbable bigram:

  • Selected alternatives with similar training levels but complete tokens
  • Ensured fairness of comparative experiments

Experimental Results

Main Results

ModelImprobable Bigram Hallucination RateBaseline Bigram Hallucination Rate
Llama 3.148/100 (48%)0/100 (0%)
Exaone77/100 (77%)20/100 (20%)
Qwen2.533/100 (33%)0/100 (0%)
Mistral-Nemo52/71 (73%)1/71 (1%)
Command-R49/100 (49%)8/100 (8%)

Key Finding: Improbable bigrams composed of incomplete tokens demonstrate significantly higher hallucination rates across all models.

Alternative Tokenization Experiment Results

ModelOriginal Tokenization Hallucination RateAlternative Tokenization Hallucination RateImprovement
Llama 3.10.480.05↓90%
Exaone0.770.50↓35%
Qwen2.50.330.12↓64%
Mistral-Nemo0.730.01↓98%
Command-R0.490.55No improvement

Important Finding: Except for Command-R, all models showed significantly reduced hallucination rates with alternative tokenization, confirming that the problem originates from incomplete tokens.

Language Distribution Analysis

  • Improbable bigrams span multiple language pair combinations
  • High-resource multi-byte scripts (Chinese, Korean, Russian) appear most frequently
  • Language pair distributions vary significantly across models (Exaone has 17 language pairs, Command-R only 3)

Tokenizer Vulnerability Research

  1. Glitch Token Research: Land and Bartolo (2024) proposed embedding layer heuristic methods to identify undertrained tokens
  2. Adversarial Tokenization: Wang et al. (2024) created adversarial problems inducing incorrect tokenization
  3. Tokenization Fairness: Petrov et al. (2023) and Ovalle et al. (2024) studied unfairness and bias introduced by tokenizers

BPE Tokenizer Research

  1. Compression Effectiveness Questioning: Schmidt et al. (2024) challenged assumptions about BPE effectiveness deriving from compression
  2. Greedy Compression Issues: Bostrom and Durrett (2020) noted that greedy compression prioritizes frequency over linguistic significance
  3. Morphological Improvements: Limisiewicz et al. (2024) and Bauwens et al. (2024) proposed morphology-driven BPE improvements

Uniqueness of This Paper's Contribution

Unlike existing research, this paper:

  • Focuses on structural rather than training quality issues
  • Demonstrates that adequately trained tokens may still be vulnerable
  • Provides systematic attack construction methodology

Conclusions and Discussion

Main Conclusions

  1. Incomplete Tokens Have Systematic Vulnerabilities: Even when adequately trained, incomplete tokens readily cause hallucinations in specific combinations
  2. Problem Originates from Tokenization, Not Training: Alternative tokenization significantly improves the problem, proving the root cause lies in token structure
  3. Widespread Impact: This problem is prevalent across multiple mainstream models

Practical Risks

  1. Code and Data Processing: May compromise integrity of variable names or fixed values
  2. Adversarial Non-Reproducibility: Attackers can exploit non-repeatable phrases to evade LLM agent intervention
  3. Model Fingerprinting: Can be used to identify architecture behind anonymous LLM services

Mitigation Strategies

  1. Vocabulary Pruning: Remove incomplete tokens before model training
  2. Constrained BPE Merging: Respect character boundaries during tokenizer training
  3. Character-Level Tokenization: For models not requiring complete Unicode coverage, character-level tokenization is an option

Limitations

  1. Evaluation Scope: Limited to phrase-level hallucinations, without systematic evaluation of factual hallucinations
  2. Language Expertise: Test phrases span multiple languages, exceeding authors' expertise
  3. Model Specificity: Anomalous results for Command-R model require further investigation

Future Directions

  1. Safer Tokenizer Design: Develop tokenization methods avoiding incomplete tokens
  2. Robustness Evaluation: Establish comprehensive tokenization vulnerability assessment frameworks
  3. Defense Mechanism Research: Explore runtime detection and mitigation strategies

In-Depth Evaluation

Strengths

  1. Originality of Problem Identification: First systematic identification of structural vulnerabilities in byte-level BPE
  2. Methodological Rigor: Precise attack construction based on UTF-8 encoding rules with well-designed experiments
  3. Experimental Comprehensiveness: Verification across multiple models and languages with convincing results
  4. Practical Value: Provides concrete mitigation strategies and security recommendations

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why incomplete tokens are more vulnerable
  2. Unexplained Command-R Anomaly: Analysis of anomalous results for this model lacks depth
  3. Limited Evaluation Metrics: Uses only repetition tasks, potentially not fully reflecting actual harm
  4. Unknown Long-Term Impact: Does not evaluate vulnerability's effects on other model capabilities

Impact

  1. Academic Contribution: Opens new directions for tokenizer security research
  2. Practical Value: Provides important security considerations for model developers
  3. Reproducibility: Clear methodology description enables experiment reproduction
  4. Policy Implications: May influence future tokenizer design standards

Applicable Scenarios

  1. Model Security Assessment: Evaluate existing models' tokenization vulnerabilities
  2. Tokenizer Design: Guide development of safer tokenizers
  3. Adversarial Testing: Part of model robustness testing
  4. Security Audits: Pre-deployment security checks for LLMs

References

Key References:

  • Land, S. & Bartolo, M. (2024). Fishing for magikarp: Automatically detecting under-trained tokens in large language models.
  • Bostrom, K. & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining.
  • Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units.
  • Limisiewicz, T. et al. (2024). MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling.

Overall Assessment: This is a high-quality research paper identifying important security vulnerabilities in byte-level BPE tokenizers. Despite some limitations, its originality, experimental rigor, and practical value make it a significant contribution to tokenizer security research. This work is important for enhancing the safety and robustness of large language models.