2025-11-15T06:04:11.942321

Smallest Suffixient Sets as a Repetitiveness Measure

Navarro, Romana, Urbina
A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $χ$ of the smallest suffixient set as a repetitiveness measure: we place it between known measures and study its sensitivity to various string operations. As a corollary of our results, we give a simple online algorithm to compute smallest suffixient sets.
academic

Smallest Suffixient Sets as a Repetitiveness Measure

Basic Information

  • Paper ID: 2506.05638
  • Title: Smallest Suffixient Sets as a Repetitiveness Measure
  • Authors: Gonzalo Navarro (University of Chile), Giuseppe Romana (University of Palermo), Cristian Urbina (University of Chile)
  • Classification: cs.FL (Formal Languages), cs.DS (Data Structures), math.CO (Combinatorics)
  • Publication Date: October 29, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2506.05638

Abstract

This paper investigates suffixient sets as a novel combinatorial object for measuring string repetitiveness. Suffixient sets capture the essential information of repetitive strings and support various pattern matching mechanisms while providing random access. The paper deeply studies the size χ of the minimum suffixient set as a repetitiveness measure: positioning it among known measures, investigating its sensitivity to various string operations. As a byproduct of the research, the paper provides a simple online algorithm for computing the minimum suffixient set.

Research Background and Motivation

1. Problem Statement

With the emergence of large-scale collections of similar strings (such as human genome data), efficiently representing and querying highly repetitive strings has become a critical challenge. For example, the European "1+ Million Genomes" initiative aims to sequence over one million human genomes, requiring approximately 750TB of raw storage space. However, by exploiting the high similarity between genomes, storage space can be compressed by two orders of magnitude.

2. Problem Importance

Understanding how to measure repetitiveness is crucial for:

  • Designing compressed representations while maintaining access and search functionality
  • Evaluating the space efficiency of different compression schemes
  • Understanding relationships between different measures, thereby clarifying what search functionality can be obtained at what space cost

3. Limitations of Existing Approaches

Multiple repetitiveness measures have been proposed, ranging from abstract lower bounds to measures related to specific text compressors. For the recently proposed suffixient set size χ, existing research only knows:

  • γ = O(χ) (γ is the size of the minimum string attractor)
  • χ = O(r̄) (r̄ is the number of equal-letter runs in the reverse BWT)

However, deeper understanding of χ as a repetitiveness measure remains lacking, particularly:

  • Precise relationships between χ and other measures
  • Sensitivity of χ to string operations
  • Whether χ is reachable

4. Research Motivation

This paper aims to better characterize χ as a repetitiveness measure, particularly focusing on:

  • Positioning χ within the known measure hierarchy
  • Studying the impact of string update operations on χ
  • Exploring comparability between χ and copy-paste class measures
  • Providing efficient algorithms for computing χ

Core Contributions

  1. Establishing direct relationship between χ and BWT run count r: Proves χ = O(r) (rather than the previous χ = O(r̄)), and proves χ = o(v) (v is the minimum dictionary parsing size) on certain string families, thereby establishing that χ is strictly smaller than r
  2. Sensitivity analysis for string operations:
    • Proves χ grows by O(1) when appending/prepending single characters
    • Proves arbitrary edit operations or rotations can cause χ to grow by Ω(log n)
    • Proves string reversal can cause χ to grow by Ω(√n)
  3. Complete characterization of Fibonacci strings: Completely characterizes the unique 2 minimum suffixient sets of size 4 for Fibonacci strings, and proves all substrings of episturmian words satisfy χ ≤ σ + 2
  4. Incomparability with copy-paste measures: Proves χ is incomparable with most copy-paste class measures (z, z_no, z_e, z_end, v, g, g_rl, c)—there exist string families where χ is strictly smaller, and also families where χ is strictly larger
  5. Simple online algorithm: Proposes an extremely simple online algorithm that, by modifying Ukkonen's suffix tree construction algorithm, computes the minimum suffixient set in O(n) space and O(n) worst-case time

Methodology Details

Task Definition

Core Definitions:

  1. Right-maximal substring: A substring x is right-maximal if there exist at least two distinct symbols a, b ∈ Σ such that both xa and xb are substrings of w
  2. Right-extension: For each right-maximal substring x, its right-extensions are substrings of the form xa, denoted E_r(w)
  3. Super-maximal extension: A right-extension that is not a suffix of any other right-extension, denoted S_r(w), with size denoted sre(w)
  4. Suffixient set: A set S ⊆ 1..n is a suffixient set of w if for each right-extension x ∈ E_r(w), there exists j ∈ S such that x is a suffix of w1..j
  5. Minimum suffixient set: A suffixient set S is minimal if there exists a bijection pos: S_r(w) → S such that each x ∈ S_r(w) is a suffix of w1..pos(x)
  6. Measure χ: For w ∈ Σ* and Fw,defineχ(w)=S,whereSistheminimumsuffixientsetofw ∉ F_w, define χ(w) = |S|, where S is the minimum suffixient set of w

Theoretical Analysis Framework

1. Impact of Appending Characters (Core Lemma)

Lemma 2: For w ∈ Σ* and c ∈ Σ:

sre(w) ≤ sre(wc) ≤ sre(w) + 2

Proof Sketch: Analyze possible new right-extensions after appending c:

  • Case 1: xc already appears in w, or xa doesn't appear in w for any a≠c → no new right-extensions
  • Case 2: There exist a≠b such that both xa and xb appear in w, but xc doesn't → xc is a new right-extension
  • Case 3: x is always followed by a≠c in w (thus xa is not a right-extension) → both xa and xc are new right-extensions

Key observations:

  • Cases 1 and 2 together produce at most 1 new super-maximal right-extension
  • In Case 3, for fixed a, all new right-extensions x₁a, x₂a, ..., x_ta form a suffix chain, with only x_ta possibly being super-maximal

Therefore, at most 2 super-maximal right-extensions are added.

2. Relationship with BWT Run Count r

Lemma 9: χ ≤ 2r

Proof Sketch:

  • Let x_i be the i-th lexicographically smallest rotation of w$
  • Let u_i be the longest common prefix of x_i and x_{i+1}
  • Define s(i) as the shift needed to cyclically right-shift x_i to obtain w$

Construct suffixient set:

S = ∪_{i∈[1..|w|]} {s(i) + |u_i| + 1, s(i+1) + |u_i| + 1}

Using BWT matrix properties: if BWTi = BWTi+1 = c, then there exists j where corresponding positions coincide. Therefore:

S = {s(i) + |u_i| + 1, s(i+1) + |u_i| + 1 | BWT[i] ≠ BWT[i+1]}

|S| ≤ 2r (r is the number of equal-letter runs in BWT).

3. Sensitivity Analysis

De Bruijn Sequences (for lower bounds):

  • k-order binary de Bruijn sequence contains all length-k binary strings exactly once
  • Length n = 2^k + (k-1)
  • Lemma 5: sre(w) = 2^k = Ω(n) for any w ∈ dB(k)

Ω(log n) lower bound for edit operations (Lemma 6): Using the lexicographically smallest de Bruijn sequence w = a^k b a^{k-2} b x a b^k a^{k-1}:

  • Insertion: sre(w) - sre(w') = 2^k - 2
  • Substitution: sre(w) - sre(w') = 2^k - 3
  • Deletion: sre(w) - sre(w') = 2^k - 4
  • Rotation: sre(w) - sre(w') = 2^k - 2

Ω(√n) lower bound for reversal (Lemma 7): Construct w_k = ∏_^{k-1} c a^i b a^{k-i-1} #_i a^i b a^{k-i-1} $_i:

  • sre(w_k) - sre(w_k^R) = k - 1
  • |w_k| = Θ(k²)
  • Therefore growth is Ω(√n)

4. Properties of Episturmian Words

Lemma 10: For an episturmian word w over alphabet of size σ, any substring wi..j satisfies:

sre(w[i..j]) ≤ σ

Proof Sketch:

  • Each episturmian word has at most one right-maximal substring of each length
  • Right-extensions ending in a ∈ Σ form a suffix chain
  • There are σ such suffix chains
  • Each chain contributes at most 1 super-maximal right-extension in the substring

Corollary 3: For any episturmian word w, χ(wi..j) ≤ σ + 2

Exact characterization of Fibonacci strings (Lemma 11):

  • F_1 = b, F_2 = a, F_k = F_F_
  • For k ≥ 6, the unique minimum suffixient sets of F_k$ are:
    {f_{k+1}, f_{k-1}, f_{k-1}-1, p}, where p ∈ {f_{k-2}+1, 2f_{k-2}+1}
    

Online Algorithm Design

Core Idea

Modify Ukkonen's suffix tree online construction algorithm to simultaneously maintain the minimum suffixient set while processing each character.

Algorithm Key Points

Suffix Tree Enhancement: Add markers on suffix tree nodes indicating positions of super-maximal right-extensions.

Handling appended character c with three cases:

  1. s has child labeled c (Case 1):
    • Only descend to new s
    • No marker updates needed
  2. s has ≥2 children, no child labeled c (Case 2):
    • Create new leaf child of s (labeled c)
    • Mark s's child c
    • Unmark s''s child c (if marked)
  3. s has only one child (labeled a≠c) (Case 3):
    • Mark both children of s (a and c)
    • Unmark s''s child c (if marked)
    • Unmark s'''s child a (if marked), where s''' is the first node in the suffix chain with another child b≠a

Complexity:

  • Space: O(n)
  • Time: O(n) (on transdichotomous RAM with polynomial-size alphabet)

Theorem 1: There exists an algorithm that processes text T from left to right, determining the minimum suffixient set of T1..n after processing each prefix, using O(n) space and O(n) worst-case time.

Experimental Setup

Note: This is a theoretical paper whose main contributions are theoretical results and algorithms, without traditional experimental evaluation sections. The paper's "experiments" are conducted through mathematical proofs and constructive examples to verify theoretical results.

Theoretical Verification Methods

  1. Constructive proofs: Prove tightness of bounds by constructing specific string families (such as de Bruijn sequences, Fibonacci strings)
  2. Counterexample construction: Demonstrate correctness of theoretical results through concrete examples (such as Example 1 with w_3)
  3. Code verification: Authors acknowledge using code from Cenzato et al. to compute minimum suffixient sets for proposing and verifying hypotheses

Experimental Results

Main Theoretical Results

1. Relationship between χ and Known Measures

Upper bound relationships:

  • χ ≤ 2r (Lemma 9)
  • χ = O(r)

Lower bound relationships:

  • γ = O(χ) (known result 4)
  • δ ≤ χ (known result 4)

Separation results:

  • String families exist where χ = o(v) (Corollary 4, Fibonacci strings)
  • Since v = O(r), this means χ is strictly smaller than r

2. Summary of Sensitivity Results

OperationAdditive SensitivityMultiplicative Sensitivity
Append characterO(1)May be non-monotonic
Prepend characterO(1)-
InsertionΩ(log n)O(max(1, log(n/δ log δ)) log δ)
DeletionΩ(log n)O(max(1, log(n/δ log δ)) log δ)
SubstitutionΩ(log n)O(max(1, log(n/δ log δ)) log δ)
RotationΩ(log n)O(max(1, log(n/δ log δ)) log δ)
ReversalΩ(√n)O(max(1, log(n/δ log δ)) log δ)

3. Exact Values for Specific String Families

Episturmian words:

  • χ(wi..j) ≤ σ + 2 (Corollary 3)

Fibonacci strings (k ≥ 6):

  • χ(F_k) = 4
  • Exact characterization of minimum suffixient sets (Lemma 11)

De Bruijn sequences:

  • sre(w) = 2^k = Ω(n) (Lemma 5)
  • χ = Ω(n)

4. Comparison with Copy-Paste Measures

Cases where χ is strictly smaller (Fibonacci strings):

  • χ = O(1)
  • c = Ω(log n)
  • Therefore χ = o(µ), for all µ ∈ {z, z_no, z_e, z_end, v, g, g_rl, c}

Cases where χ is strictly larger (de Bruijn sequences):

  • χ = Ω(n)
  • g = O(n/log n)
  • Therefore χ = Ω(g log n) (Corollary 5)
  • χ = Ω(z_e · log n log log log n / (log log n)²) (Lemma 12)

Conclusion (Corollary 6): χ is incomparable with µ ∈ {z, z_no, z_e, z_end, v, g, g_rl, c}

Case Analysis

Example 1 (Concrete example for Lemma 7):

String w_3 = cbaa#₀baa₀caba#₁aba₁caab#₂aab$₂

Super-maximal right-extensions:

  1. baa and c
  2. baa#₀ and baa₀; aba#₁ and aba₁; aab#₂ and aab$₂
  3. ca and cb; caa and cab
  4. aba and aab

Total: sre(w_3) = 14

Reversed string w_3^R = ₂baa#₂baac₁aba#₁abac$₀aab#₀aabc

Super-maximal right-extensions:

  1. baa and $₂
  2. baa#₂ and baac; aba#₁ and abac; aab#₀ and aabc
  3. ac;ac₀; ac
  4. aba and aab

Total: sre(w_3^R) = 12

Verification: sre(w_3) - sre(w_3^R) = 2 = k - 1 ✓

1. Repetitiveness Measure Research

Abstract lower bound measures:

  • δ: Substring complexity measure, maxₖ(|F_w(k)|/k)
  • γ: Minimum string attractor size 18
    • Known: γ = O(χ)
    • Whether γ is reachable remains open

Copy-paste class measures:

  • z: Lempel-Ziv parsing size 20
  • z_no: LZ parsing without overlapping phrase sources
  • z_e: Greedy LZ-End parsing size
  • z_end: Minimum LZ-End parsing size
  • v: Minimum dictionary parsing size 28
  • g: Minimum context-free grammar size
  • g_rl: Grammar size allowing run-length rules
  • c: Minimum tiling system size
  • b: Minimum bidirectional macro scheme size 32

BWT-related measures:

  • r: BWT equal-letter run count 3
    • Only known reachable and searchable measure without known access capability
    • This paper proves χ < r

2. Sensitivity Research

Previous work has examined sensitivity of various measures to string operations:

  • Akagi et al. 1: Sensitivity of b, z, g to edit operations
  • Giuliani et al. 14,15: Sensitivity of r to reversals and bit changes
  • Fici et al. 9,10: Sensitivity of BWT run count to morphisms
  • Navarro et al. 29,30: Morphism-based repetitiveness measures

3. Suffixient Sets

Original work 4,6:

  • Define suffixient sets and related concepts
  • Prove γ = O(χ) and χ = O(r̄)
  • Show suffixient sets support pattern matching

Algorithmic work 5:

  • Efficient algorithms for computing minimum suffixient sets
  • Starting from suffix array and suffix tree components
  • Non-online algorithms

This paper's contributions:

  • Deeper theoretical characterization
  • Simpler online algorithm
  • Direct construction from text while building suffix tree

4. Episturmian Words and Fibonacci Strings

Combinatorial background 8,16,21:

  • Episturmian words: at most one right-maximal substring per length
  • Research on right-special factors (i.e., right-maximal substrings)
  • Fibonacci strings as special cases of epistandard words

This paper's contributions:

  • Connect suffixient sets with combinatorial word theory
  • Complete characterization of minimum suffixient sets for Fibonacci strings
  • Prove χ upper bounds for episturmian word substrings

Conclusions and Discussion

Main Conclusions

  1. Measure positioning: χ is established as a measure strictly smaller than r, satisfying:
    • γ = O(χ) = O(r)
    • String families exist where χ = o(r)
  2. Sensitivity characteristics:
    • O(1) additive sensitivity to appending/prepending characters (ideal property)
    • Ω(log n) additive sensitivity to arbitrary edit operations
    • Ω(√n) additive sensitivity to reversal
  3. Incomparability with copy-paste measures: χ is neither always larger nor always smaller, depending on string families
  4. Efficient online computation: Can be computed online in O(n) time and space

Limitations

  1. Reachability unknown:
    • Whether χ is reachable (i.e., can represent strings in O(χ) space) remains open
    • Relationship with minimum known reachable measure b is unknown
  2. Access mechanism dependency:
    • Suffixient set support for searching requires additional random access mechanism
    • Unlike r, cannot directly support searching
  3. Tightness of theoretical bounds:
    • χ = O(r) constant factor of 2 may not be tight
    • Exact multiplicative sensitivity bounds remain unclear
  4. Practical performance:
    • Paper focuses mainly on theoretical properties
    • Performance on real data requires further experimental verification

Future Directions

Open problems explicitly proposed in the paper:

  1. Reachability problem:
    • Proving b = O(χ) would establish χ reachability
    • Or proving χ unreachable, which would simultaneously prove γ unreachable
  2. Relationship with δ:
    • Is χ = O(δ log n)?
    • Is the Θ(log n) factor on de Bruijn sequences tight?
  3. Multiplicative sensitivity:
    • Is sre(w')/sre(w) = O(1) for all considered string operations?
    • Do constant bounds exist for insertion operations?
  4. Precise relationship with r:
    • Is r = O(χ log χ)?
    • If true and χ has O(1) multiplicative sensitivity to string operations, would tighten known r bounds
  5. Other measure relationships:
    • Relationship between χ and b (key reachability problem)
    • Relationship with other newly proposed measures

In-Depth Evaluation

Strengths

1. Solid Theoretical Contributions

  • Complete measure characterization: Through upper and lower bound analysis and separation results, precisely positions χ in the measure hierarchy
  • Tight bounds: Constructive proofs (e.g., de Bruijn sequences, Fibonacci strings) demonstrate bound tightness
  • Multi-faceted analysis: Comprehensively studies χ from multiple perspectives including sensitivity, special string families, and relationships with other measures

2. Technical Innovation

  • Direct relationship establishment: Proves χ = O(r) rather than previous χ = O(r̄), a more natural characterization
  • Fine-grained analysis: Complete characterization of Fibonacci string minimum suffixient sets (Lemma 11) demonstrates deep combinatorial analysis capability
  • Elegant algorithm: Simplifies complex suffixient set computation to elegant modification of Ukkonen's algorithm

3. Clear Writing

  • Rigorous definitions: All concepts have precise mathematical definitions
  • Detailed proofs: Proof sketches of key lemmas are clear and easy to verify
  • Rich examples: Concrete examples (e.g., Example 1) aid understanding of abstract concepts
  • Intuitive figures: Figure 1 clearly displays relationships between measures

4. Practical Value

  • Online algorithm: O(n) time and space online algorithm has practical application value
  • Theoretical guidance: Deep understanding of χ aids design of compression and indexing structures based on suffixient sets
  • Measure selection: Provides theoretical basis for selecting appropriate repetitiveness measures in practical applications

Weaknesses

1. Missing Experimental Verification

  • No real data testing: Paper is entirely theoretical without experiments on real data (e.g., genome data)
  • Algorithm performance unknown: While O(n) algorithm is provided, actual runtime, space constants are unknown
  • No comparison with existing tools: Lacks detailed performance comparison with Cenzato et al.'s implementation 5

2. Unresolved Reachability Problem

  • Key problem remains open: Whether χ is reachable remains unresolved
  • Unknown relationship with b: Cannot determine relationship between χ and minimum known reachable measure
  • Limited practical utility: If χ is unreachable, its practical value as a measure is limited

3. Potentially Non-Tight Bounds

  • Constant factors: χ ≤ 2r constant of 2 may not be optimal
  • Sensitivity upper bounds: Lemma 8's upper bounds are relatively complex, possibly non-tight
  • Multiplicative sensitivity: Exact multiplicative sensitivity bounds not provided

4. Limited Analysis Scope

  • Special string families: Analysis mainly concentrates on de Bruijn sequences, Fibonacci strings and other special cases
  • General results: Limited understanding of properties for general string families
  • Average case: Focuses on worst case, average case analysis lacking

Impact

1. Contribution to Field

  • Theory completion: Fills gaps in suffixient set theoretical research
  • Measure system: Enriches theoretical framework of repetitiveness measures
  • Open problems: Proposed problems will guide future research directions

2. Potential Applications

  • Compression algorithms: Provides theoretical foundation for designing new compression algorithms
  • Index structures: Suffixient sets can be used to construct space-efficient indexes
  • Bioinformatics: Potential applications in genome data compression and querying

3. Methodological Contributions

  • Online construction paradigm: Demonstrates how to adapt classical suffix tree algorithms to new problems
  • Sensitivity analysis framework: Provides methodological reference for analyzing sensitivity of other measures

4. Limitations

  • Good reproducibility: Theoretical results easy to verify, algorithm descriptions clear
  • But implementation details: Some implementation details (e.g., specific marker maintenance) need further clarification
  • Model assumptions: Some results depend on transdichotomous RAM model

Applicable Scenarios

1. Ideal Application Scenarios

  • Highly repetitive data: Such as genome sequences, version control systems, log files
  • Pattern matching needed: Suffixient sets naturally support certain pattern matching queries
  • Online processing: Data arrives in streams, requiring incremental index updates

2. Inapplicable Scenarios

  • Random data: χ on low-repetitiveness data may approach n, losing advantage
  • Exact space guarantees needed: χ reachability unknown, cannot guarantee O(χ) space implementation
  • Complex queries: Limited query types supported by suffixient sets

3. Comparison with Other Methods

  • vs. BWT (r): χ smaller but requires additional access mechanism
  • vs. LZ (z): χ smaller in some cases (Fibonacci), larger in others (de Bruijn)
  • vs. Grammar (g): Similarly incomparable

References

Key Citations

  1. Original suffixient sets papers:
    • 6 Depuydt et al., "Suffixient sets", 2023
    • 4 Cenzato et al., "Suffixient arrays", 2025
  2. Computational algorithms:
    • 5 Cenzato et al., "On computing the smallest suffixient set", SPIRE 2024
    • 33 Ukkonen, "On-line construction of suffix trees", 1995
  3. Repetitiveness measure surveys:
    • 25,26 Navarro, "Indexing highly repetitive string collections", ACM Computing Surveys 2021
    • 27 Navarro, "Indexing highly repetitive string collections", arXiv 2022
  4. Related measures:
    • 18 Kempa & Prezza, "String attractors", STOC 2018
    • 3 Burrows & Wheeler, "BWT", 1994
    • 20 Lempel & Ziv, "LZ complexity", 1976
    • 28 Navarro et al., "Ordered parsings", IEEE TIT 2021
  5. Sensitivity research:
    • 1 Akagi et al., "Sensitivity of string compressors", Information and Computation 2023
    • 15 Giuliani et al., "Bit catastrophes for BWT", Theory of Computing Systems 2025

Overall Assessment: This is a high-quality theoretical paper providing deep and comprehensive analysis of suffixient sets as a repetitiveness measure. Main contributions include establishing direct relationship between χ and r, sensitivity analysis, exact characterization of special string families, and elegant online algorithm. The paper's theoretical analysis is rigorous, proofs detailed, and writing clear. Main weaknesses are lack of experimental verification and unresolved reachability problem. The paper establishes solid theoretical foundation for suffixient set research, with proposed open problems guiding future directions. Recommended future work includes performance evaluation on real data and resolution of the reachability problem.