2025-11-10T02:48:52.266770

When is String Reconstruction using de Bruijn Graphs Hard?

Bals, van Krieken, Pissis et al.

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Î£$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m \cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (\log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.

academic

When is String Reconstruction using de Bruijn Graphs Hard?

Basic Information

Paper ID: 2508.03433
Title: When is String Reconstruction using de Bruijn Graphs Hard?
Authors: Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek
Classification: cs.DS (Data Structures and Algorithms)
Publication Date: August 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2508.03433

Abstract

This paper investigates the computational complexity of string reconstruction using de Bruijn graphs. The problem originates from fragment assembly in genome assembly and has made significant progress through reduction to the classical Eulerian path problem. The authors focus on the core question: given a function modeling domain knowledge (mapping each k-length string to an interval of possible positions in the reconstructed string), how can one efficiently reconstruct an optimal string from a de Bruijn graph? This translates to finding Eulerian paths satisfying interval constraints on a graph. The paper analyzes the problem within the parameterized complexity framework and proposes algorithms with exponential improvements over existing techniques.

Research Background and Motivation

Problem Background

Genome Assembly Problem: Recombining large numbers of short DNA fragments to reconstruct the original chromosome representation, one of the most important algorithmic tasks in bioinformatics
de Bruijn Graph Method: Pevzner et al. reduced the fragment assembly problem to an Eulerian path problem using k-order de Bruijn graphs, where a single Eulerian path represents a candidate genome reconstruction
Data Privacy Applications: Bernardini et al. introduced a complementary data privacy framework based on z-anonymity by releasing strings obtained from random Eulerian paths to protect original strings

Research Motivation

Core Problem: Given a function c modeling domain knowledge (mapping each edge to an interval of possible positions in the Eulerian path), how can one efficiently compute Eulerian paths satisfying c?
Practical Needs: In genome assembly and data privacy applications, it is often necessary to incorporate domain knowledge to constrain the reconstruction process
Existing Limitations: While Hannenhalli et al. proved the problem is NP-complete, there lacks in-depth analysis of parameterized complexity

Core Contributions

Hardness Results: Proves that finding Eulerian paths satisfying interval constraints in de Bruijn graphs over binary alphabets is NP-complete (Theorem 3.1)
Inapproximability: Proves that the optimization version admits no constant-factor polynomial-time approximation algorithm (Corollary 3.5)
Improved Algorithm: Proposes an FPT algorithm with parameter w(log w+1)/(k-1) for de Bruijn graphs, with running time O(m·λ^(w/(k-1)+1)), achieving exponential improvement over existing algorithms
Extended Results: Extends the algorithm to counting and enumerating minimum-cost Eulerian paths, and proves related counting algorithms

Methodology Details

Task Definition

diET Problem (Eulerian Path with Interval Function on Directed Graphs):

Input: Directed graph G=(V,E), interval function c: E → I_m
Output: Determine whether there exists an Eulerian path P = e₁...e_m such that for all t ∈ m, t ∈ c(e_t)

dicET Problem (Version with Interval Cost Function):

Input: Directed graph G=(V,E), interval cost function c: E × m → Z∪{∞}, budget c_budget ∈ N
Output: Determine whether there exists an Eulerian path P with total cost not exceeding c_budget

Core Technical Framework

1. Hardness Proof Technique

The authors prove NP-completeness through reduction from the directed Hamiltonian path problem:

Construct a complete k-order de Bruijn graph where k-1 = 4ℓ+10, ℓ = ⌈log₂(|V|)⌉
Associate two nodes v'₁ and v'₂ with each node v in the original graph, and nodes e'₁ and e'₂ with each edge e
Carefully design the interval function such that Eulerian paths satisfying the constraints correspond to Hamiltonian paths in the original graph

2. Parameterized Algorithm Design

State Space Construction: Construct an auxiliary directed graph H=(V',E') of size O(m·λ^(w/(k-1)+1)), where:

λ := min(|Σ|^(k-1), 2w-1)
Nodes have the form (t,α), where t is a time step and α is a string of length min(w,t)+k-1

Key Insight:

Any path of length r in a de Bruijn graph is completely determined by the string of length r+k-1 it generates
This string can be completely determined by checking every (k-1)-th node on the path

3. Algorithm Improvement Strategy

Compared to the existing O(m·w^1.54w) algorithm, the new algorithm achieves improvement through:

Exploiting the special structure of de Bruijn graphs
Transforming state representation from general graph edge sets to de Bruijn graph-specific string representations
Significantly reducing the number of states to consider through combinatorial insights

Technical Innovations

Structured State Encoding: Uses string α to encode path states in de Bruijn graphs more compactly than generic methods
Parameter Improvement: Improves from parameter w to w(log w+1)/(k-1), achieving significant improvement when k is large
Unified Framework: A single framework handles decision, optimization, counting, and enumeration problems

Experimental Setup

Theoretical Analysis Framework

This paper primarily conducts theoretical analysis, focusing on:

Complexity Analysis: Proving computational complexity lower bounds through reductions
Algorithm Analysis: Analyzing time complexity and correctness of new algorithms
Parameterized Analysis: Studying algorithm performance under different parameter settings

Complexity Comparison

Existing Algorithm: O(m·w^1.54w)
New Algorithm: O(m·λ^(w/(k-1)+1)), where λ = min(|Σ|^(k-1), 2w-1)
Improvement Magnitude: Exponential improvement with exponent (log w+1)/(k-1)

Experimental Results

Main Theoretical Results

Theorem 3.1: The diET problem is NP-complete on de Bruijn graphs over binary alphabets
Theorem 4.4: New FPT algorithm with running time O(m·λ^(w/(k-1)+1))
Theorem 5.1: Counting algorithm that counts minimum-cost Eulerian paths in the same time complexity

Algorithm Performance Analysis

Practical Improvement Effects:

When k=31 (bioinformatics standard) and |Σ|=4, achieves 30-fold exponential speedup
When w·log w/(k-1) = O(1), algorithm runs in O(mw) time
When w = O(1), algorithm runs in O(m) time

Extended Results

Multigraph Extension: Algorithm extends to de Bruijn multigraphs
Undirected Graphs: Proves the undirected version uicET is also NP-complete
Counting and Enumeration: Provides algorithms for counting and enumerating minimum-cost solutions

Main Research Directions

Genome Assembly: Pioneering work by Pevzner et al., secure complete assembly by Cairo et al.
Temporal Graphs: Related algorithms by Bumpus and Meeks on temporal graphs
Chinese Postman Problem: Hierarchical Chinese Postman Problem (HCP) and Time-Constrained Chinese Postman Problem (TCCP)

Uniqueness of This Paper's Contribution

First for Binary Alphabets: Previous work required |Σ|≥4
de Bruijn Graph Specialization: Fully exploits the structural properties of de Bruijn graphs
Parameterized Improvement: Improves from w to w(log w+1)/(k-1) parameterization

Conclusions and Discussion

Main Conclusions

Theoretical Lower Bounds: String reconstruction remains hard even over binary alphabets
Algorithmic Upper Bounds: The problem becomes tractable when interval width is relatively small compared to k
Practical Significance: Provides theoretical foundations and practical algorithms for genome assembly and data privacy applications

Limitations

Alphabet Size: Improvements are mainly significant for fixed-size alphabets
Parameter Dependence: Algorithm efficiency still depends on interval width w
Practical Implementation: Paper focuses on theoretical analysis, lacking practical implementation and experimental validation

Future Directions

Other Graph Classes: Investigate applications of similar techniques to other structured graphs
Approximation Algorithms: Develop approximation algorithms with theoretical guarantees
Practical Applications: Validate algorithm performance on real genome data

In-Depth Evaluation

Strengths

Deep Theoretical Contribution: Completes the complexity landscape of the problem, reducing from |Σ|=4 to |Σ|=2
Significant Algorithm Improvement: Achieves exponential improvement through structural insights
Strong Technical Innovation: Cleverly exploits the string representation properties of de Bruijn graphs
High Application Value: Directly applies to two important domains: genome assembly and data privacy

Weaknesses

Lack of Experimental Validation: Pure theoretical work without verification on real data
Constant Factors: Constant factors in theoretical analysis may be large, affecting practical performance
Parameter Restrictions: Improvements are mainly significant within specific parameter ranges

Impact

Theoretical Significance: Provides new case studies for parameterized complexity theory
Practical Value: Provides theoretical guidance for genome assembly software development
Methodological Contribution: Demonstrates how to exploit graph structural properties to improve generic algorithms

Applicable Scenarios

Genome Assembly: de Bruijn graph assembly with large k values
Data Privacy: String publication requiring k-anonymity guarantees
Sequence Analysis: Other bioinformatics applications involving de Bruijn graphs

References

The paper cites 17 important references, primarily including:

Pevzner et al. (PNAS 2001): Foundational work on de Bruijn graph methods
Hannenhalli et al. (CABIOS 1996): Initial problem formulation
Ben-Dor et al. (J. Comput. Biol. 2002): State-of-the-art algorithms
Bernardini et al. (ALENEX 2020): Data privacy applications
Bumpus and Meeks (Algorithmica 2023): Related work on temporal graphs

This paper makes important contributions at the intersection of theoretical computer science and bioinformatics. Through in-depth complexity analysis and clever algorithm design, it provides new theoretical insights and practical algorithms for a fundamental and important problem.