2025-11-15T06:04:11.942321

Smallest Suffixient Sets as a Repetitiveness Measure

Navarro, Romana, Urbina

A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $Ï$ of the smallest suffixient set as a repetitiveness measure: we place it between known measures and study its sensitivity to various string operations. As a corollary of our results, we give a simple online algorithm to compute smallest suffixient sets.

academic

Smallest Suffixient Sets as a Repetitiveness Measure

Basic Information

Paper ID: 2506.05638
Title: Smallest Suffixient Sets as a Repetitiveness Measure
Authors: Gonzalo Navarro (University of Chile), Giuseppe Romana (University of Palermo), Cristian Urbina (University of Chile)
Classification: cs.FL (Formal Languages), cs.DS (Data Structures), math.CO (Combinatorics)
Publication Date: October 29, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2506.05638

Abstract

This paper investigates suffixient sets as a novel combinatorial object for measuring string repetitiveness. Suffixient sets capture the essential information of repetitive strings and support various pattern matching mechanisms while providing random access. The paper deeply studies the size χ of the minimum suffixient set as a repetitiveness measure: positioning it among known measures, investigating its sensitivity to various string operations. As a byproduct of the research, the paper provides a simple online algorithm for computing the minimum suffixient set.

Research Background and Motivation

1. Problem Statement

With the emergence of large-scale collections of similar strings (such as human genome data), efficiently representing and querying highly repetitive strings has become a critical challenge. For example, the European "1+ Million Genomes" initiative aims to sequence over one million human genomes, requiring approximately 750TB of raw storage space. However, by exploiting the high similarity between genomes, storage space can be compressed by two orders of magnitude.

2. Problem Importance

Understanding how to measure repetitiveness is crucial for:

Designing compressed representations while maintaining access and search functionality
Evaluating the space efficiency of different compression schemes
Understanding relationships between different measures, thereby clarifying what search functionality can be obtained at what space cost

3. Limitations of Existing Approaches

Multiple repetitiveness measures have been proposed, ranging from abstract lower bounds to measures related to specific text compressors. For the recently proposed suffixient set size χ, existing research only knows:

γ = O(χ) (γ is the size of the minimum string attractor)
χ = O(r̄) (r̄ is the number of equal-letter runs in the reverse BWT)

However, deeper understanding of χ as a repetitiveness measure remains lacking, particularly:

Precise relationships between χ and other measures
Sensitivity of χ to string operations
Whether χ is reachable

4. Research Motivation

This paper aims to better characterize χ as a repetitiveness measure, particularly focusing on:

Positioning χ within the known measure hierarchy
Studying the impact of string update operations on χ
Exploring comparability between χ and copy-paste class measures
Providing efficient algorithms for computing χ

Core Contributions

Establishing direct relationship between χ and BWT run count r: Proves χ = O(r) (rather than the previous χ = O(r̄)), and proves χ = o(v) (v is the minimum dictionary parsing size) on certain string families, thereby establishing that χ is strictly smaller than r
Sensitivity analysis for string operations:
- Proves χ grows by O(1) when appending/prepending single characters
- Proves arbitrary edit operations or rotations can cause χ to grow by Ω(log n)
- Proves string reversal can cause χ to grow by Ω(√n)
Complete characterization of Fibonacci strings: Completely characterizes the unique 2 minimum suffixient sets of size 4 for Fibonacci strings, and proves all substrings of episturmian words satisfy χ ≤ σ + 2
Incomparability with copy-paste measures: Proves χ is incomparable with most copy-paste class measures (z, z_no, z_e, z_end, v, g, g_rl, c)—there exist string families where χ is strictly smaller, and also families where χ is strictly larger
Simple online algorithm: Proposes an extremely simple online algorithm that, by modifying Ukkonen's suffix tree construction algorithm, computes the minimum suffixient set in O(n) space and O(n) worst-case time

Methodology Details

Task Definition

Core Definitions:

Right-maximal substring: A substring x is right-maximal if there exist at least two distinct symbols a, b ∈ Σ such that both xa and xb are substrings of w
Right-extension: For each right-maximal substring x, its right-extensions are substrings of the form xa, denoted E_r(w)
Super-maximal extension: A right-extension that is not a suffix of any other right-extension, denoted S_r(w), with size denoted sre(w)
Suffixient set: A set S ⊆ 1..n is a suffixient set of w if for each right-extension x ∈ E_r(w), there exists j ∈ S such that x is a suffix of w1..j
Minimum suffixient set: A suffixient set S is minimal if there exists a bijection pos: S_r(w) → S such that each x ∈ S_r(w) is a suffix of w1..pos(x)
Measure χ: For w ∈ Σ* and $∉ F_w, define χ(w) = |S|, where S is the minimum suffixient set of w$

Theoretical Analysis Framework

1. Impact of Appending Characters (Core Lemma)

Lemma 2: For w ∈ Σ* and c ∈ Σ:

sre(w) ≤ sre(wc) ≤ sre(w) + 2

Proof Sketch: Analyze possible new right-extensions after appending c:

Case 1: xc already appears in w, or xa doesn't appear in w for any a≠c → no new right-extensions
Case 2: There exist a≠b such that both xa and xb appear in w, but xc doesn't → xc is a new right-extension
Case 3: x is always followed by a≠c in w (thus xa is not a right-extension) → both xa and xc are new right-extensions

Key observations:

Cases 1 and 2 together produce at most 1 new super-maximal right-extension
In Case 3, for fixed a, all new right-extensions x₁a, x₂a, ..., x_ta form a suffix chain, with only x_ta possibly being super-maximal

Therefore, at most 2 super-maximal right-extensions are added.

2. Relationship with BWT Run Count r

Lemma 9: χ ≤ 2r

Proof Sketch:

Let x_i be the i-th lexicographically smallest rotation of w$
Let u_i be the longest common prefix of x_i and x_{i+1}
Define s(i) as the shift needed to cyclically right-shift x_i to obtain w$

Construct suffixient set:

S = ∪_{i∈[1..|w|]} {s(i) + |u_i| + 1, s(i+1) + |u_i| + 1}

Using BWT matrix properties: if BWTi = BWTi+1 = c, then there exists j where corresponding positions coincide. Therefore:

S = {s(i) + |u_i| + 1, s(i+1) + |u_i| + 1 | BWT[i] ≠ BWT[i+1]}

|S| ≤ 2r (r is the number of equal-letter runs in BWT).

3. Sensitivity Analysis

De Bruijn Sequences (for lower bounds):

k-order binary de Bruijn sequence contains all length-k binary strings exactly once
Length n = 2^k + (k-1)
Lemma 5: sre(w) = 2^k = Ω(n) for any w ∈ dB(k)

Ω(log n) lower bound for edit operations (Lemma 6): Using the lexicographically smallest de Bruijn sequence w = a^k b a^{k-2} b x a b^k a^{k-1}:

Insertion: sre(w) - sre(w') = 2^k - 2
Substitution: sre(w) - sre(w') = 2^k - 3
Deletion: sre(w) - sre(w') = 2^k - 4
Rotation: sre(w) - sre(w') = 2^k - 2

Ω(√n) lower bound for reversal (Lemma 7): Construct w_k = ∏_^{k-1} c a^i b a^{k-i-1} #_i a^i b a^{k-i-1} $_i:

sre(w_k) - sre(w_k^R) = k - 1
|w_k| = Θ(k²)
Therefore growth is Ω(√n)

4. Properties of Episturmian Words

Lemma 10: For an episturmian word w over alphabet of size σ, any substring wi..j satisfies:

sre(w[i..j]) ≤ σ

Proof Sketch:

Each episturmian word has at most one right-maximal substring of each length
Right-extensions ending in a ∈ Σ form a suffix chain
There are σ such suffix chains
Each chain contributes at most 1 super-maximal right-extension in the substring

Corollary 3: For any episturmian word w, χ(wi..j) ≤ σ + 2

Exact characterization of Fibonacci strings (Lemma 11):

F_1 = b, F_2 = a, F_k = F_F_

For k ≥ 6, the unique minimum suffixient sets of F_k$ are:

{f_{k+1}, f_{k-1}, f_{k-1}-1, p}, where p ∈ {f_{k-2}+1, 2f_{k-2}+1}

Online Algorithm Design

Core Idea

Modify Ukkonen's suffix tree online construction algorithm to simultaneously maintain the minimum suffixient set while processing each character.

Algorithm Key Points

Suffix Tree Enhancement: Add markers on suffix tree nodes indicating positions of super-maximal right-extensions.

Handling appended character c with three cases:

s has child labeled c (Case 1):
- Only descend to new s
- No marker updates needed
s has ≥2 children, no child labeled c (Case 2):
- Create new leaf child of s (labeled c)
- Mark s's child c
- Unmark s''s child c (if marked)
s has only one child (labeled a≠c) (Case 3):
- Mark both children of s (a and c)
- Unmark s''s child c (if marked)
- Unmark s'''s child a (if marked), where s''' is the first node in the suffix chain with another child b≠a

Complexity:

Space: O(n)
Time: O(n) (on transdichotomous RAM with polynomial-size alphabet)

Theorem 1: There exists an algorithm that processes text T from left to right, determining the minimum suffixient set of T1..n after processing each prefix, using O(n) space and O(n) worst-case time.

Experimental Setup

Note: This is a theoretical paper whose main contributions are theoretical results and algorithms, without traditional experimental evaluation sections. The paper's "experiments" are conducted through mathematical proofs and constructive examples to verify theoretical results.

Theoretical Verification Methods

Constructive proofs: Prove tightness of bounds by constructing specific string families (such as de Bruijn sequences, Fibonacci strings)
Counterexample construction: Demonstrate correctness of theoretical results through concrete examples (such as Example 1 with w_3)
Code verification: Authors acknowledge using code from Cenzato et al. to compute minimum suffixient sets for proposing and verifying hypotheses

Experimental Results

Main Theoretical Results

1. Relationship between χ and Known Measures

Upper bound relationships:

χ ≤ 2r (Lemma 9)
χ = O(r)

Lower bound relationships:

γ = O(χ) (known result 4)
δ ≤ χ (known result 4)

Separation results:

String families exist where χ = o(v) (Corollary 4, Fibonacci strings)
Since v = O(r), this means χ is strictly smaller than r

2. Summary of Sensitivity Results

Operation	Additive Sensitivity	Multiplicative Sensitivity
Append character	O(1)	May be non-monotonic
Prepend character	O(1)	-
Insertion	Ω(log n)	O(max(1, log(n/δ log δ)) log δ)
Deletion	Ω(log n)	O(max(1, log(n/δ log δ)) log δ)
Substitution	Ω(log n)	O(max(1, log(n/δ log δ)) log δ)
Rotation	Ω(log n)	O(max(1, log(n/δ log δ)) log δ)
Reversal	Ω(√n)	O(max(1, log(n/δ log δ)) log δ)

3. Exact Values for Specific String Families

Episturmian words:

χ(wi..j) ≤ σ + 2 (Corollary 3)

Fibonacci strings (k ≥ 6):

χ(F_k) = 4
Exact characterization of minimum suffixient sets (Lemma 11)

De Bruijn sequences:

sre(w) = 2^k = Ω(n) (Lemma 5)
χ = Ω(n)

4. Comparison with Copy-Paste Measures

Cases where χ is strictly smaller (Fibonacci strings):

χ = O(1)
c = Ω(log n)
Therefore χ = o(µ), for all µ ∈ {z, z_no, z_e, z_end, v, g, g_rl, c}

Cases where χ is strictly larger (de Bruijn sequences):

χ = Ω(n)
g = O(n/log n)
Therefore χ = Ω(g log n) (Corollary 5)
χ = Ω(z_e · log n log log log n / (log log n)²) (Lemma 12)

Conclusion (Corollary 6): χ is incomparable with µ ∈ {z, z_no, z_e, z_end, v, g, g_rl, c}

Case Analysis

Example 1 (Concrete example for Lemma 7):

String w_3 = cbaa#₀baa $₀caba#₁aba$ ₁caab#₂aab$₂

Super-maximal right-extensions:

baa and c
baa#₀ and baa $₀; aba#₁ and aba$ ₁; aab#₂ and aab$₂
ca and cb; caa and cab
aba and aab

Total: sre(w_3) = 14

Reversed string w_3^R = $₂baa#₂baac$ ₁aba#₁abac$₀aab#₀aabc

Super-maximal right-extensions:

baa and $₂
baa#₂ and baac; aba#₁ and abac; aab#₀ and aabc
ac $₀; ac$ ₁
aba and aab

Total: sre(w_3^R) = 12

Verification: sre(w_3) - sre(w_3^R) = 2 = k - 1 ✓

1. Repetitiveness Measure Research

Abstract lower bound measures:

δ: Substring complexity measure, maxₖ(|F_w(k)|/k)
γ: Minimum string attractor size 18
- Known: γ = O(χ)
- Whether γ is reachable remains open

Copy-paste class measures:

z: Lempel-Ziv parsing size 20
z_no: LZ parsing without overlapping phrase sources
z_e: Greedy LZ-End parsing size
z_end: Minimum LZ-End parsing size
v: Minimum dictionary parsing size 28
g: Minimum context-free grammar size
g_rl: Grammar size allowing run-length rules
c: Minimum tiling system size
b: Minimum bidirectional macro scheme size 32

BWT-related measures:

r: BWT equal-letter run count 3
- Only known reachable and searchable measure without known access capability
- This paper proves χ < r

2. Sensitivity Research

Previous work has examined sensitivity of various measures to string operations:

Akagi et al. 1: Sensitivity of b, z, g to edit operations
Giuliani et al. 14,15: Sensitivity of r to reversals and bit changes
Fici et al. 9,10: Sensitivity of BWT run count to morphisms
Navarro et al. 29,30: Morphism-based repetitiveness measures

3. Suffixient Sets

Original work 4,6:

Define suffixient sets and related concepts
Prove γ = O(χ) and χ = O(r̄)
Show suffixient sets support pattern matching

Algorithmic work 5:

Efficient algorithms for computing minimum suffixient sets
Starting from suffix array and suffix tree components
Non-online algorithms

This paper's contributions:

Deeper theoretical characterization
Simpler online algorithm
Direct construction from text while building suffix tree

4. Episturmian Words and Fibonacci Strings

Combinatorial background 8,16,21:

Episturmian words: at most one right-maximal substring per length
Research on right-special factors (i.e., right-maximal substrings)
Fibonacci strings as special cases of epistandard words

This paper's contributions:

Connect suffixient sets with combinatorial word theory
Complete characterization of minimum suffixient sets for Fibonacci strings
Prove χ upper bounds for episturmian word substrings

Conclusions and Discussion

Main Conclusions

Measure positioning: χ is established as a measure strictly smaller than r, satisfying:
- γ = O(χ) = O(r)
- String families exist where χ = o(r)
Sensitivity characteristics:
- O(1) additive sensitivity to appending/prepending characters (ideal property)
- Ω(log n) additive sensitivity to arbitrary edit operations
- Ω(√n) additive sensitivity to reversal
Incomparability with copy-paste measures: χ is neither always larger nor always smaller, depending on string families
Efficient online computation: Can be computed online in O(n) time and space

Limitations

Reachability unknown:
- Whether χ is reachable (i.e., can represent strings in O(χ) space) remains open
- Relationship with minimum known reachable measure b is unknown
Access mechanism dependency:
- Suffixient set support for searching requires additional random access mechanism
- Unlike r, cannot directly support searching
Tightness of theoretical bounds:
- χ = O(r) constant factor of 2 may not be tight
- Exact multiplicative sensitivity bounds remain unclear
Practical performance:
- Paper focuses mainly on theoretical properties
- Performance on real data requires further experimental verification

Future Directions

Open problems explicitly proposed in the paper:

Reachability problem:
- Proving b = O(χ) would establish χ reachability
- Or proving χ unreachable, which would simultaneously prove γ unreachable
Relationship with δ:
- Is χ = O(δ log n)?
- Is the Θ(log n) factor on de Bruijn sequences tight?
Multiplicative sensitivity:
- Is sre(w')/sre(w) = O(1) for all considered string operations?
- Do constant bounds exist for insertion operations?
Precise relationship with r:
- Is r = O(χ log χ)?
- If true and χ has O(1) multiplicative sensitivity to string operations, would tighten known r bounds
Other measure relationships:
- Relationship between χ and b (key reachability problem)
- Relationship with other newly proposed measures

In-Depth Evaluation

Strengths

1. Solid Theoretical Contributions

Complete measure characterization: Through upper and lower bound analysis and separation results, precisely positions χ in the measure hierarchy
Tight bounds: Constructive proofs (e.g., de Bruijn sequences, Fibonacci strings) demonstrate bound tightness
Multi-faceted analysis: Comprehensively studies χ from multiple perspectives including sensitivity, special string families, and relationships with other measures

2. Technical Innovation

Direct relationship establishment: Proves χ = O(r) rather than previous χ = O(r̄), a more natural characterization
Fine-grained analysis: Complete characterization of Fibonacci string minimum suffixient sets (Lemma 11) demonstrates deep combinatorial analysis capability
Elegant algorithm: Simplifies complex suffixient set computation to elegant modification of Ukkonen's algorithm

3. Clear Writing

Rigorous definitions: All concepts have precise mathematical definitions
Detailed proofs: Proof sketches of key lemmas are clear and easy to verify
Rich examples: Concrete examples (e.g., Example 1) aid understanding of abstract concepts
Intuitive figures: Figure 1 clearly displays relationships between measures

4. Practical Value

Online algorithm: O(n) time and space online algorithm has practical application value
Theoretical guidance: Deep understanding of χ aids design of compression and indexing structures based on suffixient sets
Measure selection: Provides theoretical basis for selecting appropriate repetitiveness measures in practical applications

Weaknesses

1. Missing Experimental Verification

No real data testing: Paper is entirely theoretical without experiments on real data (e.g., genome data)
Algorithm performance unknown: While O(n) algorithm is provided, actual runtime, space constants are unknown
No comparison with existing tools: Lacks detailed performance comparison with Cenzato et al.'s implementation 5

2. Unresolved Reachability Problem

Key problem remains open: Whether χ is reachable remains unresolved
Unknown relationship with b: Cannot determine relationship between χ and minimum known reachable measure
Limited practical utility: If χ is unreachable, its practical value as a measure is limited

3. Potentially Non-Tight Bounds

Constant factors: χ ≤ 2r constant of 2 may not be optimal
Sensitivity upper bounds: Lemma 8's upper bounds are relatively complex, possibly non-tight
Multiplicative sensitivity: Exact multiplicative sensitivity bounds not provided

4. Limited Analysis Scope

Special string families: Analysis mainly concentrates on de Bruijn sequences, Fibonacci strings and other special cases
General results: Limited understanding of properties for general string families
Average case: Focuses on worst case, average case analysis lacking

Impact

1. Contribution to Field

Theory completion: Fills gaps in suffixient set theoretical research
Measure system: Enriches theoretical framework of repetitiveness measures
Open problems: Proposed problems will guide future research directions

2. Potential Applications

Compression algorithms: Provides theoretical foundation for designing new compression algorithms
Index structures: Suffixient sets can be used to construct space-efficient indexes
Bioinformatics: Potential applications in genome data compression and querying

3. Methodological Contributions

Online construction paradigm: Demonstrates how to adapt classical suffix tree algorithms to new problems
Sensitivity analysis framework: Provides methodological reference for analyzing sensitivity of other measures

4. Limitations

Good reproducibility: Theoretical results easy to verify, algorithm descriptions clear
But implementation details: Some implementation details (e.g., specific marker maintenance) need further clarification
Model assumptions: Some results depend on transdichotomous RAM model

Applicable Scenarios

1. Ideal Application Scenarios

Highly repetitive data: Such as genome sequences, version control systems, log files
Pattern matching needed: Suffixient sets naturally support certain pattern matching queries
Online processing: Data arrives in streams, requiring incremental index updates

2. Inapplicable Scenarios

Random data: χ on low-repetitiveness data may approach n, losing advantage
Exact space guarantees needed: χ reachability unknown, cannot guarantee O(χ) space implementation
Complex queries: Limited query types supported by suffixient sets

3. Comparison with Other Methods

vs. BWT (r): χ smaller but requires additional access mechanism
vs. LZ (z): χ smaller in some cases (Fibonacci), larger in others (de Bruijn)
vs. Grammar (g): Similarly incomparable

References

Key Citations

Original suffixient sets papers:
- 6 Depuydt et al., "Suffixient sets", 2023
- 4 Cenzato et al., "Suffixient arrays", 2025
Computational algorithms:
- 5 Cenzato et al., "On computing the smallest suffixient set", SPIRE 2024
- 33 Ukkonen, "On-line construction of suffix trees", 1995
Repetitiveness measure surveys:
- 25,26 Navarro, "Indexing highly repetitive string collections", ACM Computing Surveys 2021
- 27 Navarro, "Indexing highly repetitive string collections", arXiv 2022
Related measures:
- 18 Kempa & Prezza, "String attractors", STOC 2018
- 3 Burrows & Wheeler, "BWT", 1994
- 20 Lempel & Ziv, "LZ complexity", 1976
- 28 Navarro et al., "Ordered parsings", IEEE TIT 2021
Sensitivity research:
- 1 Akagi et al., "Sensitivity of string compressors", Information and Computation 2023
- 15 Giuliani et al., "Bit catastrophes for BWT", Theory of Computing Systems 2025

Overall Assessment: This is a high-quality theoretical paper providing deep and comprehensive analysis of suffixient sets as a repetitiveness measure. Main contributions include establishing direct relationship between χ and r, sensitivity analysis, exact characterization of special string families, and elegant online algorithm. The paper's theoretical analysis is rigorous, proofs detailed, and writing clear. Main weaknesses are lack of experimental verification and unresolved reachability problem. The paper establishes solid theoretical foundation for suffixient set research, with proposed open problems guiding future directions. Recommended future work includes performance evaluation on real data and resolution of the reachability problem.