Smallest Suffixient Sets as a Repetitiveness Measure
Navarro, Romana, Urbina
A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $Ï$ of the smallest suffixient set as a repetitiveness measure: we place it between known measures and study its sensitivity to various string operations. As a corollary of our results, we give a simple online algorithm to compute smallest suffixient sets.
academic
Smallest Suffixient Sets as a Repetitiveness Measure
This paper investigates suffixient sets as a novel combinatorial object for measuring string repetitiveness. Suffixient sets capture the essential information of repetitive strings and support various pattern matching mechanisms while providing random access. The paper deeply studies the size χ of the minimum suffixient set as a repetitiveness measure: positioning it among known measures, investigating its sensitivity to various string operations. As a byproduct of the research, the paper provides a simple online algorithm for computing the minimum suffixient set.
With the emergence of large-scale collections of similar strings (such as human genome data), efficiently representing and querying highly repetitive strings has become a critical challenge. For example, the European "1+ Million Genomes" initiative aims to sequence over one million human genomes, requiring approximately 750TB of raw storage space. However, by exploiting the high similarity between genomes, storage space can be compressed by two orders of magnitude.
Multiple repetitiveness measures have been proposed, ranging from abstract lower bounds to measures related to specific text compressors. For the recently proposed suffixient set size χ, existing research only knows:
γ = O(χ) (γ is the size of the minimum string attractor)
χ = O(r̄) (r̄ is the number of equal-letter runs in the reverse BWT)
However, deeper understanding of χ as a repetitiveness measure remains lacking, particularly:
Precise relationships between χ and other measures
Establishing direct relationship between χ and BWT run count r: Proves χ = O(r) (rather than the previous χ = O(r̄)), and proves χ = o(v) (v is the minimum dictionary parsing size) on certain string families, thereby establishing that χ is strictly smaller than r
Sensitivity analysis for string operations:
Proves χ grows by O(1) when appending/prepending single characters
Proves arbitrary edit operations or rotations can cause χ to grow by Ω(log n)
Proves string reversal can cause χ to grow by Ω(√n)
Complete characterization of Fibonacci strings: Completely characterizes the unique 2 minimum suffixient sets of size 4 for Fibonacci strings, and proves all substrings of episturmian words satisfy χ ≤ σ + 2
Incomparability with copy-paste measures: Proves χ is incomparable with most copy-paste class measures (z, z_no, z_e, z_end, v, g, g_rl, c)—there exist string families where χ is strictly smaller, and also families where χ is strictly larger
Simple online algorithm: Proposes an extremely simple online algorithm that, by modifying Ukkonen's suffix tree construction algorithm, computes the minimum suffixient set in O(n) space and O(n) worst-case time
Right-maximal substring: A substring x is right-maximal if there exist at least two distinct symbols a, b ∈ Σ such that both xa and xb are substrings of w
Right-extension: For each right-maximal substring x, its right-extensions are substrings of the form xa, denoted E_r(w)
Super-maximal extension: A right-extension that is not a suffix of any other right-extension, denoted S_r(w), with size denoted sre(w)
Suffixient set: A set S ⊆ 1..n is a suffixient set of w if for each right-extension x ∈ E_r(w), there exists j ∈ S such that x is a suffix of w1..j
Minimum suffixient set: A suffixient set S is minimal if there exists a bijection pos: S_r(w) → S such that each x ∈ S_r(w) is a suffix of w1..pos(x)
Measure χ: For w ∈ Σ* and ∈/Fw,defineχ(w)=∣S∣,whereSistheminimumsuffixientsetofw
Suffix Tree Enhancement: Add markers on suffix tree nodes indicating positions of super-maximal right-extensions.
Handling appended character c with three cases:
s has child labeled c (Case 1):
Only descend to new s
No marker updates needed
s has ≥2 children, no child labeled c (Case 2):
Create new leaf child of s (labeled c)
Mark s's child c
Unmark s''s child c (if marked)
s has only one child (labeled a≠c) (Case 3):
Mark both children of s (a and c)
Unmark s''s child c (if marked)
Unmark s'''s child a (if marked), where s''' is the first node in the suffix chain with another child b≠a
Complexity:
Space: O(n)
Time: O(n) (on transdichotomous RAM with polynomial-size alphabet)
Theorem 1: There exists an algorithm that processes text T from left to right, determining the minimum suffixient set of T1..n after processing each prefix, using O(n) space and O(n) worst-case time.
Note: This is a theoretical paper whose main contributions are theoretical results and algorithms, without traditional experimental evaluation sections. The paper's "experiments" are conducted through mathematical proofs and constructive examples to verify theoretical results.
Multi-faceted analysis: Comprehensively studies χ from multiple perspectives including sensitivity, special string families, and relationships with other measures
28 Navarro et al., "Ordered parsings", IEEE TIT 2021
Sensitivity research:
1 Akagi et al., "Sensitivity of string compressors", Information and Computation 2023
15 Giuliani et al., "Bit catastrophes for BWT", Theory of Computing Systems 2025
Overall Assessment: This is a high-quality theoretical paper providing deep and comprehensive analysis of suffixient sets as a repetitiveness measure. Main contributions include establishing direct relationship between χ and r, sensitivity analysis, exact characterization of special string families, and elegant online algorithm. The paper's theoretical analysis is rigorous, proofs detailed, and writing clear. Main weaknesses are lack of experimental verification and unresolved reachability problem. The paper establishes solid theoretical foundation for suffixient set research, with proposed open problems guiding future directions. Recommended future work includes performance evaluation on real data and resolution of the reachability problem.