2025-11-11T16:28:09.601154

SAT-sampling for statistical significance testing in sparse contingency tables

Scharpfenecker, Windisch

Exact conditional tests for contingency tables require sampling from fibers with fixed margins. Classical Markov basis MCMC is general but often impractical: computing full Markov bases that connect all fibers of a given constraint matrix can be infeasible and the resulting chains may converge slowly, especially in sparse settings or in presence of structural zeros. We introduce a SAT-based alternative that encodes fibers as Boolean circuits which allows modern SAT samplers to generate tables randomly. We analyze the sampling bias that SAT samplers may introduce, provide diagnostics, and propose practical mitigation. We propose hybrid MCMC schemes that combine SAT proposals with local moves to ensure correct stationary distributions which do not necessarily require connectivity via local moves which is particularly beneficial in presence of structural zeros. Across benchmarks, including small and involved tables with many structural zeros where pure Markov-basis methods underperform, our methods deliver reliable conditional p-values and often outperform samplers that rely on precomputed Markov bases.

academic

SAT-sampling for statistical significance testing in sparse contingency tables

基本信息

论文ID: 2511.05709
标题: SAT-sampling for statistical significance testing in sparse contingency tables
作者: Patrick Scharpfenecker, Tobias Windisch (University of Applied Sciences Kempten, Germany)
分类: stat.ME (Statistics - Methodology), math.CO (Mathematics - Combinatorics), stat.CO (Statistics - Computation)
发表时间: 2025年11月7日
论文链接: https://arxiv.org/abs/2511.05709

摘要

本文针对列联表的精确条件检验问题，提出了一种基于SAT求解器的新方法来替代传统的Markov基MCMC方法。传统方法需要计算连接所有纤维的完整Markov基，这在稀疏设置或存在结构零时往往不可行且收敛缓慢。作者将纤维编码为布尔电路，利用现代SAT采样器随机生成表格，分析了SAT采样器可能引入的采样偏差，并提出了实用的缓解策略。通过混合MCMC方案结合SAT提议和局部移动，确保了正确的平稳分布，特别适用于存在结构零的情况。

研究背景与动机

问题定义

列联表的精确条件推断是统计学中的重要问题，特别是独立性检验。这类问题需要在固定边际约束下对纤维（fibers）进行采样，即寻找满足线性约束 $Au = b$ 的非负整数表 $u$ 。

现有方法的局限性

传统的Markov基MCMC方法面临两个主要瓶颈：

计算复杂性：为现实模型和表格大小计算完整的Markov基通常在计算上是禁止性的或完全不可行的
收敛问题：即使有可用的基，诱导的移动可能混合缓慢，需要大量调优工作
结构零问题：结构零和其他约束会增加Markov基的大小并使连通性复杂化

研究动机

作者观察到现代SAT求解器在处理大型结构化实例方面表现出色，特别是冲突驱动子句学习（CDCL）求解器。近年来SAT采样技术的进展（如UniGen3、CMSGen等）为解决纤维采样问题提供了新的可能性。

核心贡献

SAT编码方法：提出了将纤维约束编码为布尔电路的高效方法，通过Tseitin变换转换为CNF形式，保持稀疏性并在CDCL求解器中实现强单元传播
采样偏差分析：量化了最先进SAT采样器中采样偏差的程度和结构，开发了实用的缓解技术来提高条件p值的准确性
混合MCMC方案：提出了两种混合方案 $A_n(M)$ 和 $P_{n,k}(M)$ ，结合SAT提议和局部移动，确保正确的平稳分布
性能提升：在包含结构零的小型复杂表格等基准测试中，展示了相比传统Markov基方法的性能优势

方法详解

任务定义

给定约束矩阵 $A \in \mathbb{N}^{k \times d}$ 和边际向量 $b \in \mathbb{Z}^k$ ，目标是从纤维 $F_{A,b} = \{u \in \mathbb{N}^d : Au = b\}$ 中采样，以近似条件p值：

$E_\rho[f] = \sum_{u \in F_{A,b}} f(u)\rho(u)$

其中 $\rho(v) \sim \frac{1}{v_1! \cdots v_d!}$ ， $f(v) = \mathbf{1}_{X(v) \geq X(u^{obs})}$

SAT编码架构

布尔电路编码

约束表示：将线性约束 $Au = b$ 重新表述为一系列加法、乘法和等式检查
位表示：使用 $l$ 位表示每个条目，其中 $l = \lceil \log_2(\max_{i,j,A_{i,j}>0} \frac{b_i}{A_{i,j}}) \rceil$
电路构造：构建大小为 $\text{poly}(k,d,l)$ 的布尔电路 $C$

Tseitin变换

使用经典Tseitin编码将电路 $C$ 转换为CNF形式 $F$ ，满足：

$C(u_1, \ldots, u_d) = 1$ 当且仅当存在 $y_1, \ldots, y_m$ 使得 $F(u_1, \ldots, u_d, y_1, \ldots, y_m) = 1$
在 $F_{A,b} \cap [2^l-1]^d$ 和 $F$ 的满足解之间建立双射关系

混合MCMC方案

$A_n(M)$ 方案

每 $n$ 步中有一步使用SAT采样器，其余使用预定义的移动集合 $M$ ：

交替执行SAT步骤和Markov基移动
保持较低的SAT步骤比例以缓解结构偏差

$P_{n,k}(M)$ 方案

并行管理 $k$ 个随机游走：

首先使用SAT采样器从纤维中采样 $n$ 个独立初始点
然后执行 $k$ 个使用 $M$ 的随机游走
每 $n$ 步随机选择一个游走继续 $n$ 步

独立性模型 $I_{d_1 \times \cdots \times d_k}$ ： $d_1 \times d_2 \times \cdots \times d_k$ 独立性模型
准独立性模型 $QI_{d_1 \times \cdots \times d_k}(S)$ ：带结构零 $S$ 的独立性模型
无三因子交互模型 $N3F_d$ ： $d \times d \times d$ 表上的无三因子交互模型