2025-11-10T02:48:52.266770

When is String Reconstruction using de Bruijn Graphs Hard?

Bals, van Krieken, Pissis et al.

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Î£$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m \cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (\log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.

academic

de Bruijn グラフを用いた文字列再構成はいつ困難か？

基本情報

論文ID: 2508.03433
タイトル: When is String Reconstruction using de Bruijn Graphs Hard?
著者: Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek
分類: cs.DS（データ構造とアルゴリズム）
発表日: 2025年8月10日（arXiv プレプリント）
論文リンク: https://arxiv.org/abs/2508.03433

要旨

本論文は、de Bruijn グラフに基づく文字列再構成問題の計算複雑性を研究している。この問題はゲノム組立における断片接合問題に由来し、古典的なオイラー路問題への帰約により大きな進展を遂げている。著者らが焦点を当てる核心的な問題は、領域知識をモデル化する関数（長さ k の各文字列を再構成文字列内での出現可能位置の区間にマッピング）が与えられたとき、de Bruijn グラフから最適文字列を効率的に再構成するにはどうするかである。これは、グラフ上で区間制約を満たすオイラー路を探索する問題に変換される。論文は、パラメータ化複雑性の枠組みの下でこの問題を分析し、既存技術に比べて指数関数的な改善をもたらすアルゴリズムを提案している。

研究背景と動機

問題背景

ゲノム組立問題: 大量の短い DNA 断片を再結合して元の染色体表現を再構成する問題。生物情報学における最も重要なアルゴリズム課題の一つ
de Bruijn グラフ手法: Pevzner らが断片組立問題をオイラー路問題に帰約し、k 次 de Bruijn グラフを使用。単一のオイラー路が候補ゲノム再構成を表現
データプライバシー応用: Bernardini らが z-匿名性に基づいて補完的なデータプライバシーフレームワークを導入。ランダムなオイラー路から得られた文字列を公開することで元の文字列を保護

研究動機

核心的問題: 領域知識をモデル化する関数 c（各辺をオイラー路内での出現可能区間にマッピング）が与えられたとき、c を満たすオイラー路を効率的に計算するには？
実際的需要: ゲノム組立とデータプライバシー応用において、領域知識を組み込んで再構成プロセスを制約する必要がしばしば生じる
既存の限界: Hannenhalli らがこの問題が NP 完全であることを証明したが、パラメータ化複雑性に関する深い分析が不足している

主要な貢献

困難性結果: 二元字母表上の de Bruijn グラフにおいて、区間制約を満たすオイラー路を探索する問題が NP 完全であることを証明（定理 3.1）
近似不可能性: 最適化版問題が定数因子多項式時間近似アルゴリズムを持たないことを証明（系 3.5）
改善されたアルゴリズム: de Bruijn グラフに対して、パラメータ w(log w+1)/(k-1) の FPT アルゴリズムを提案。実行時間は O(m·λ^(w/(k-1)+1)) で、既存アルゴリズムに比べて指数関数的な改善を達成
拡張結果: アルゴリズムを最小コストオイラー路の計数と列挙に拡張し、関連する計数アルゴリズムを証明