The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Σ$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity.
The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete.
We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m \cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (\log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.
- Paper ID: 2508.03433
- Title: When is String Reconstruction using de Bruijn Graphs Hard?
- Authors: Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek
- Classification: cs.DS (Data Structures and Algorithms)
- Publication Date: August 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2508.03433
This paper investigates the computational complexity of string reconstruction using de Bruijn graphs. The problem originates from fragment assembly in genome assembly and has made significant progress through reduction to the classical Eulerian path problem. The authors focus on the core question: given a function modeling domain knowledge (mapping each k-length string to an interval of possible positions in the reconstructed string), how can one efficiently reconstruct an optimal string from a de Bruijn graph? This translates to finding Eulerian paths satisfying interval constraints on a graph. The paper analyzes the problem within the parameterized complexity framework and proposes algorithms with exponential improvements over existing techniques.
- Genome Assembly Problem: Recombining large numbers of short DNA fragments to reconstruct the original chromosome representation, one of the most important algorithmic tasks in bioinformatics
- de Bruijn Graph Method: Pevzner et al. reduced the fragment assembly problem to an Eulerian path problem using k-order de Bruijn graphs, where a single Eulerian path represents a candidate genome reconstruction
- Data Privacy Applications: Bernardini et al. introduced a complementary data privacy framework based on z-anonymity by releasing strings obtained from random Eulerian paths to protect original strings
- Core Problem: Given a function c modeling domain knowledge (mapping each edge to an interval of possible positions in the Eulerian path), how can one efficiently compute Eulerian paths satisfying c?
- Practical Needs: In genome assembly and data privacy applications, it is often necessary to incorporate domain knowledge to constrain the reconstruction process
- Existing Limitations: While Hannenhalli et al. proved the problem is NP-complete, there lacks in-depth analysis of parameterized complexity
- Hardness Results: Proves that finding Eulerian paths satisfying interval constraints in de Bruijn graphs over binary alphabets is NP-complete (Theorem 3.1)
- Inapproximability: Proves that the optimization version admits no constant-factor polynomial-time approximation algorithm (Corollary 3.5)
- Improved Algorithm: Proposes an FPT algorithm with parameter w(log w+1)/(k-1) for de Bruijn graphs, with running time O(m·λ^(w/(k-1)+1)), achieving exponential improvement over existing algorithms
- Extended Results: Extends the algorithm to counting and enumerating minimum-cost Eulerian paths, and proves related counting algorithms
diET Problem (Eulerian Path with Interval Function on Directed Graphs):
- Input: Directed graph G=(V,E), interval function c: E → I_m
- Output: Determine whether there exists an Eulerian path P = e₁...e_m such that for all t ∈ m, t ∈ c(e_t)
dicET Problem (Version with Interval Cost Function):
- Input: Directed graph G=(V,E), interval cost function c: E × m → Z∪{∞}, budget c_budget ∈ N
- Output: Determine whether there exists an Eulerian path P with total cost not exceeding c_budget
The authors prove NP-completeness through reduction from the directed Hamiltonian path problem:
- Construct a complete k-order de Bruijn graph where k-1 = 4ℓ+10, ℓ = ⌈log₂(|V|)⌉
- Associate two nodes v'₁ and v'₂ with each node v in the original graph, and nodes e'₁ and e'₂ with each edge e
- Carefully design the interval function such that Eulerian paths satisfying the constraints correspond to Hamiltonian paths in the original graph
State Space Construction: Construct an auxiliary directed graph H=(V',E') of size O(m·λ^(w/(k-1)+1)), where:
- λ := min(|Σ|^(k-1), 2w-1)
- Nodes have the form (t,α), where t is a time step and α is a string of length min(w,t)+k-1
Key Insight:
- Any path of length r in a de Bruijn graph is completely determined by the string of length r+k-1 it generates
- This string can be completely determined by checking every (k-1)-th node on the path
Compared to the existing O(m·w^1.54w) algorithm, the new algorithm achieves improvement through:
- Exploiting the special structure of de Bruijn graphs
- Transforming state representation from general graph edge sets to de Bruijn graph-specific string representations
- Significantly reducing the number of states to consider through combinatorial insights
- Structured State Encoding: Uses string α to encode path states in de Bruijn graphs more compactly than generic methods
- Parameter Improvement: Improves from parameter w to w(log w+1)/(k-1), achieving significant improvement when k is large
- Unified Framework: A single framework handles decision, optimization, counting, and enumeration problems
This paper primarily conducts theoretical analysis, focusing on:
- Complexity Analysis: Proving computational complexity lower bounds through reductions
- Algorithm Analysis: Analyzing time complexity and correctness of new algorithms
- Parameterized Analysis: Studying algorithm performance under different parameter settings
- Existing Algorithm: O(m·w^1.54w)
- New Algorithm: O(m·λ^(w/(k-1)+1)), where λ = min(|Σ|^(k-1), 2w-1)
- Improvement Magnitude: Exponential improvement with exponent (log w+1)/(k-1)
- Theorem 3.1: The diET problem is NP-complete on de Bruijn graphs over binary alphabets
- Theorem 4.4: New FPT algorithm with running time O(m·λ^(w/(k-1)+1))
- Theorem 5.1: Counting algorithm that counts minimum-cost Eulerian paths in the same time complexity
Practical Improvement Effects:
- When k=31 (bioinformatics standard) and |Σ|=4, achieves 30-fold exponential speedup
- When w·log w/(k-1) = O(1), algorithm runs in O(mw) time
- When w = O(1), algorithm runs in O(m) time
- Multigraph Extension: Algorithm extends to de Bruijn multigraphs
- Undirected Graphs: Proves the undirected version uicET is also NP-complete
- Counting and Enumeration: Provides algorithms for counting and enumerating minimum-cost solutions
- Genome Assembly: Pioneering work by Pevzner et al., secure complete assembly by Cairo et al.
- Temporal Graphs: Related algorithms by Bumpus and Meeks on temporal graphs
- Chinese Postman Problem: Hierarchical Chinese Postman Problem (HCP) and Time-Constrained Chinese Postman Problem (TCCP)
- First for Binary Alphabets: Previous work required |Σ|≥4
- de Bruijn Graph Specialization: Fully exploits the structural properties of de Bruijn graphs
- Parameterized Improvement: Improves from w to w(log w+1)/(k-1) parameterization
- Theoretical Lower Bounds: String reconstruction remains hard even over binary alphabets
- Algorithmic Upper Bounds: The problem becomes tractable when interval width is relatively small compared to k
- Practical Significance: Provides theoretical foundations and practical algorithms for genome assembly and data privacy applications
- Alphabet Size: Improvements are mainly significant for fixed-size alphabets
- Parameter Dependence: Algorithm efficiency still depends on interval width w
- Practical Implementation: Paper focuses on theoretical analysis, lacking practical implementation and experimental validation
- Other Graph Classes: Investigate applications of similar techniques to other structured graphs
- Approximation Algorithms: Develop approximation algorithms with theoretical guarantees
- Practical Applications: Validate algorithm performance on real genome data
- Deep Theoretical Contribution: Completes the complexity landscape of the problem, reducing from |Σ|=4 to |Σ|=2
- Significant Algorithm Improvement: Achieves exponential improvement through structural insights
- Strong Technical Innovation: Cleverly exploits the string representation properties of de Bruijn graphs
- High Application Value: Directly applies to two important domains: genome assembly and data privacy
- Lack of Experimental Validation: Pure theoretical work without verification on real data
- Constant Factors: Constant factors in theoretical analysis may be large, affecting practical performance
- Parameter Restrictions: Improvements are mainly significant within specific parameter ranges
- Theoretical Significance: Provides new case studies for parameterized complexity theory
- Practical Value: Provides theoretical guidance for genome assembly software development
- Methodological Contribution: Demonstrates how to exploit graph structural properties to improve generic algorithms
- Genome Assembly: de Bruijn graph assembly with large k values
- Data Privacy: String publication requiring k-anonymity guarantees
- Sequence Analysis: Other bioinformatics applications involving de Bruijn graphs
The paper cites 17 important references, primarily including:
- Pevzner et al. (PNAS 2001): Foundational work on de Bruijn graph methods
- Hannenhalli et al. (CABIOS 1996): Initial problem formulation
- Ben-Dor et al. (J. Comput. Biol. 2002): State-of-the-art algorithms
- Bernardini et al. (ALENEX 2020): Data privacy applications
- Bumpus and Meeks (Algorithmica 2023): Related work on temporal graphs
This paper makes important contributions at the intersection of theoretical computer science and bioinformatics. Through in-depth complexity analysis and clever algorithm design, it provides new theoretical insights and practical algorithms for a fundamental and important problem.