2025-11-10T02:48:52.266770

When is String Reconstruction using de Bruijn Graphs Hard?

Bals, van Krieken, Pissis et al.

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Î£$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m \cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (\log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.

academic

de Bruijn 그래프를 이용한 문자열 재구성이 어려운 경우는 언제인가?

기본 정보

논문 ID: 2508.03433
제목: When is String Reconstruction using de Bruijn Graphs Hard?
저자: Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek
분류: cs.DS (자료구조 및 알고리즘)
발표 시간: 2025년 8월 10일 (arXiv 사전인쇄본)
논문 링크: https://arxiv.org/abs/2508.03433

초록

본 논문은 de Bruijn 그래프 기반 문자열 재구성 문제의 계산 복잡성을 연구합니다. 이 문제는 게놈 조립의 단편 결합 문제에서 비롯되었으며, 고전적인 오일러 경로 문제로의 축약을 통해 상당한 진전을 이루었습니다. 저자들이 초점을 맞춘 핵심 문제는 다음과 같습니다: 길이 k인 각 문자열을 재구성된 문자열에서 나타날 수 있는 위치 구간으로 매핑하는 함수가 주어졌을 때, de Bruijn 그래프에서 최적 문자열을 효율적으로 재구성하는 방법입니다. 이는 그래프에서 구간 제약을 만족하는 오일러 경로를 찾는 문제로 변환됩니다. 논문은 매개변수화 복잡성 프레임워크 하에서 이 문제를 분석하고, 기존 기술 대비 지수 수준의 개선을 이룬 알고리즘을 제시합니다.

연구 배경 및 동기

문제 배경

게놈 조립 문제: 수많은 짧은 DNA 단편을 재결합하여 원래의 염색체 표현을 재구성하는 것으로, 생물정보학에서 가장 중요한 알고리즘 작업 중 하나입니다
de Bruijn 그래프 방법: Pevzner 등이 단편 조립 문제를 오일러 경로 문제로 축약했으며, k차 de Bruijn 그래프를 사용하여 단일 오일러 경로가 후보 게놈 재구성을 나타냅니다
데이터 프라이버시 응용: Bernardini 등이 z-익명성을 기반으로 상호 보완적인 데이터 프라이버시 프레임워크를 도입했으며, 무작위 오일러 경로에서 얻은 문자열을 방출하여 원본 문자열을 보호합니다

연구 동기

핵심 문제: 영역 지식을 모델링하는 함수 c가 주어졌을 때(각 간선을 오일러 경로에서 나타날 수 있는 구간으로 매핑), c를 만족하는 오일러 경로를 효율적으로 계산하는 방법입니다
실제 필요성: 게놈 조립 및 데이터 프라이버시 응용에서 영역 지식을 결합하여 재구성 프로세스를 제약해야 하는 경우가 자주 발생합니다
기존 한계: Hannenhalli 등이 이 문제가 NP 완전임을 증명했지만, 매개변수화 복잡성에 대한 심층 분석이 부족합니다

핵심 기여

경직성 결과: 이진 알파벳의 de Bruijn 그래프에서 구간 제약을 만족하는 오일러 경로를 찾는 문제가 NP 완전임을 증명했습니다 (정리 3.1)
근사 불가능성: 최적화 버전 문제가 상수 인수 다항식 시간 근사 알고리즘을 갖지 않음을 증명했습니다 (추론 3.5)
개선된 알고리즘: de Bruijn 그래프에 대해 w(log w+1)/(k-1)을 매개변수로 하는 FPT 알고리즘을 제시했으며, 실행 시간은 O(m·λ^(w/(k-1)+1))로 기존 알고리즘 대비 지수 수준의 개선을 달성했습니다
확장 결과: 알고리즘을 최소 비용 오일러 경로의 계산 및 열거로 확장했으며, 관련 계산 알고리즘을 증명했습니다