2025-11-10T02:48:52.266770

When is String Reconstruction using de Bruijn Graphs Hard?

Bals, van Krieken, Pissis et al.

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Î£$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m \cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (\log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.

academic

When is String Reconstruction using de Bruijn Graphs Hard?

基本信息

论文ID: 2508.03433
标题: When is String Reconstruction using de Bruijn Graphs Hard?
作者: Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek
分类: cs.DS (数据结构与算法)
发表时间: August 10, 2025 (arXiv预印本)
论文链接: https://arxiv.org/abs/2508.03433

基因组组装问题: 将大量短DNA片段重新组合以重建原始染色体的表示，这是生物信息学中最重要的算法任务之一
de Bruijn图方法: Pevzner等人将片段组装问题归约为欧拉路径问题，使用k阶de Bruijn图，其中单条欧拉路径代表候选基因组重构
数据隐私应用: Bernardini等人基于z-匿名性引入了互补的数据隐私框架，通过释放随机欧拉路径获得的字符串来保护原始字符串

研究动机

核心问题: 给定建模领域知识的函数c（将每条边映射到其在欧拉路径中可能出现的区间），如何高效计算满足c的欧拉路径？
实际需求: 在基因组组装和数据隐私应用中，经常需要结合领域知识来约束重构过程
现有局限: 虽然Hannenhalli等人证明了该问题是NP完全的，但缺乏对参数化复杂性的深入分析

核心贡献

硬度结果: 证明了在二元字母表上的de Bruijn图中寻找满足区间约束的欧拉路径问题是NP完全的（定理3.1）
不可近似性: 证明了优化版本问题不存在常数因子多项式时间近似算法（推论3.5）
改进算法: 对于de Bruijn图，提出了参数为w(log w+1)/(k-1)的FPT算法，运行时间为O(m·λ^(w/(k-1)+1))，相比现有算法获得指数级改进
扩展结果: 将算法扩展到计数和枚举最小代价欧拉路径，并证明了相关的计数算法