2025-11-11T16:28:09.601154

SAT-sampling for statistical significance testing in sparse contingency tables

Scharpfenecker, Windisch

Exact conditional tests for contingency tables require sampling from fibers with fixed margins. Classical Markov basis MCMC is general but often impractical: computing full Markov bases that connect all fibers of a given constraint matrix can be infeasible and the resulting chains may converge slowly, especially in sparse settings or in presence of structural zeros. We introduce a SAT-based alternative that encodes fibers as Boolean circuits which allows modern SAT samplers to generate tables randomly. We analyze the sampling bias that SAT samplers may introduce, provide diagnostics, and propose practical mitigation. We propose hybrid MCMC schemes that combine SAT proposals with local moves to ensure correct stationary distributions which do not necessarily require connectivity via local moves which is particularly beneficial in presence of structural zeros. Across benchmarks, including small and involved tables with many structural zeros where pure Markov-basis methods underperform, our methods deliver reliable conditional p-values and often outperform samplers that rely on precomputed Markov bases.

academic

SAT-sampling for statistical significance testing in sparse contingency tables

Basic Information

Paper ID: 2511.05709
Title: SAT-sampling for statistical significance testing in sparse contingency tables
Authors: Patrick Scharpfenecker, Tobias Windisch (University of Applied Sciences Kempten, Germany)
Classification: stat.ME (Statistics - Methodology), math.CO (Mathematics - Combinatorics), stat.CO (Statistics - Computation)
Publication Date: November 7, 2025
Paper Link: https://arxiv.org/abs/2511.05709

Abstract

This paper proposes a novel SAT solver-based approach to replace traditional Markov basis MCMC methods for exact conditional testing in contingency tables. The conventional approach requires computing complete Markov bases connecting all fibers, which is often infeasible and converges slowly in sparse settings or when structural zeros are present. The authors encode fibers as Boolean circuits and utilize modern SAT samplers to randomly generate tables, analyze sampling bias potentially introduced by SAT samplers, and propose practical mitigation strategies. Through a hybrid MCMC scheme combining SAT proposals and local moves, the method ensures correct stationary distributions, particularly suitable for cases with structural zeros.

Research Background and Motivation

Problem Definition

Exact conditional inference for contingency tables is an important problem in statistics, particularly for independence testing. Such problems require sampling from fibers under fixed marginal constraints, i.e., finding non-negative integer tables $u$ satisfying linear constraints $Au = b$ .

Limitations of Existing Methods

Traditional Markov basis MCMC methods face two major bottlenecks:

Computational Complexity: Computing complete Markov bases for realistic models and table sizes is typically computationally prohibitive or entirely infeasible
Convergence Issues: Even when bases are available, induced moves may mix slowly, requiring substantial tuning effort
Structural Zero Problem: Structural zeros and other constraints increase Markov basis size and complicate connectivity

Research Motivation

The authors observe that modern SAT solvers perform excellently on large structured instances, particularly conflict-driven clause learning (CDCL) solvers. Recent advances in SAT sampling techniques (such as UniGen3, CMSGen, etc.) provide new possibilities for solving fiber sampling problems.

Core Contributions

SAT Encoding Method: Proposes an efficient method for encoding fiber constraints as Boolean circuits, converting them to CNF form via Tseitin transformation while maintaining sparsity and enabling strong unit propagation in CDCL solvers
Sampling Bias Analysis: Quantifies the extent and structure of sampling bias in state-of-the-art SAT samplers, develops practical mitigation techniques to improve accuracy of conditional p-values
Hybrid MCMC Scheme: Proposes two hybrid schemes $A_n(M)$ and $P_{n,k}(M)$ combining SAT proposals and local moves, ensuring correct stationary distributions
Performance Improvement: Demonstrates performance advantages over traditional Markov basis methods on benchmark tests including small complex tables with structural zeros

Methodology Details

Task Definition

Given constraint matrix $A \in \mathbb{N}^{k \times d}$ and marginal vector $b \in \mathbb{Z}^k$ , the objective is to sample from fiber $F_{A,b} = \{u \in \mathbb{N}^d : Au = b\}$ to approximate conditional p-values:

$E_\rho[f] = \sum_{u \in F_{A,b}} f(u)\rho(u)$

where $\rho(v) \sim \frac{1}{v_1! \cdots v_d!}$ , $f(v) = \mathbf{1}_{X(v) \geq X(u^{obs})}$

SAT Encoding Architecture

Boolean Circuit Encoding

Constraint Representation: Reformulates linear constraints $Au = b$ as a series of addition, multiplication, and equality checks
Bit Representation: Uses $l$ bits to represent each entry, where $l = \lceil \log_2(\max_{i,j,A_{i,j}>0} \frac{b_i}{A_{i,j}}) \rceil$
Circuit Construction: Constructs Boolean circuit $C$ of size $\text{poly}(k,d,l)$

Tseitin Transformation

Uses classical Tseitin encoding to convert circuit $C$ to CNF form $F$ , satisfying:

$C(u_1, \ldots, u_d) = 1$ if and only if there exist $y_1, \ldots, y_m$ such that $F(u_1, \ldots, u_d, y_1, \ldots, y_m) = 1$
Establishes bijection between $F_{A,b} \cap [2^l-1]^d$ and satisfying assignments of $F$

Hybrid MCMC Schemes

$A_n(M)$ Scheme

One step out of every $n$ steps uses SAT sampler, remaining steps use predefined move set $M$ :

Alternates between SAT steps and Markov basis moves
Maintains low SAT step proportion to mitigate structural bias

$P_{n,k}(M)$ Scheme

Manages $k$ random walks in parallel:

First uses SAT sampler to sample $n$ independent initial points from fiber
Then executes $k$ random walks using $M$
Randomly selects one walk to continue for $n$ steps at each iteration

Metropolis-Hastings Correction

For SAT proposals, computes acceptance probability: $p_W(u,v) = \min\left\{1, \frac{W(v,u)}{W(u,v)} \cdot \prod_{i=1}^d \frac{u_i!}{v_i!}\right\}$

Experimental Setup

Model Categories

Independence Models $I_{d_1 \times \cdots \times d_k}$ : $d_1 \times d_2 \times \cdots \times d_k$ independence models
Quasi-Independence Models $QI_{d_1 \times \cdots \times d_k}(S)$ : Independence models with structural zeros $S$
No Three-Factor Interaction Models $N3F_d$ : No three-factor interaction models on $d \times d \times d$ tables

Evaluation Scheme

Uses evaluation scheme from Algorithm 1:

Generates $T=100$ initial samples
Runs Fisher test on each sample
Measures steps required to converge to conditional p-value (rather than sample count)
Evaluates steps needed to achieve 0.005 accuracy

Implementation Details

Uses CMSGen as primary SAT sampler (faster than UniGen3)
Implements general gradient descent method for MLE computation
Uses L-BFGS optimization, monitors marginal difference $\|A(u^{obs} - \hat{u}(\theta))\|_2$

Experimental Results

Main Results

Experimental results demonstrate SAT-based method outperforms Markov basis method in multiple scenarios, particularly in:

Sparse Tables: Excels when sample size is small or structural zeros are present
Complex Structures: $A_n(M)$ scheme outperforms $P_{n,k}(M)$ on large problem instances
Structural Zero Handling: Ensures convergence to correct p-values without requiring complete Markov bases (e.g., Graver bases)

Specific Performance

N3F4 Model: Hybrid method significantly outperforms pure Markov method at sample sizes 80 and 100
QI5×5 Model: Convergence to true p-values verified through complete fiber enumeration
QI10×10 Model: Demonstrates faster convergence across various sample sizes

Sampling Bias Analysis

Figure 2 demonstrates strong structural bias in pure SAT method, rendering it unsuitable for direct p-value approximation, but hybrid schemes successfully mitigate this issue, achieving convergence to correct p-values.

Traditional Methods

Markov Basis MCMC: Seminal work by Diaconis and Sturmfels (1998)
Dynamic Markov Bases: On-the-fly move computation by Dobra (2011)
Lattice Basis Methods: Lattice basis moves by Hazelton and Karimi (2024)

Modern Developments

RUMBA Method: High-dimensional lattice point sampling by Bakenhus and Petrović (2024)
Problem-Specific Strategies: Independence testing for large sparse tables by Zhang (2019)
Heat-Bath Methods: Dynamic move length adjustment by Stanley and Windisch (2018)

SAT Technology

SAT Solvers: High-performance solvers including CryptoMiniSat5, Kissat
SAT Sampling: Sampling tools including UniGen3, CMSGen, SMT-Sampler

Conclusions and Discussion

Main Conclusions

SAT-based method provides effective alternative for fiber sampling, particularly suitable for sparse settings with structural zeros
Hybrid MCMC scheme successfully mitigates structural bias from SAT samplers
Method significantly outperforms traditional Markov basis methods in complex scenarios involving small sample sizes or abundant structural zeros

Limitations

Computational Overhead: Single SAT sampling step may require more time than single local move
Memory Requirements: Boolean encoding of large design matrices and rich constraint sets may grow rapidly
Hyperparameter Tuning: Hybrid method introduces hyperparameters requiring tuning (e.g., number of walks, steps per walk)

Future Directions

More efficient encoding methods for ultra-high-dimensional constraint systems
Adaptive hyperparameter selection strategies
Integration with other modern sampling techniques

In-Depth Evaluation

Strengths

Strong Innovation: First systematic application of SAT technology to fiber sampling problems
Solid Theory: Provides rigorous sampling bias analysis and mitigation strategies
Comprehensive Experiments: Covers multiple model categories and scenarios with thorough benchmarking
High Practical Value: Particularly applicable to sparse and structural zero scenarios where traditional methods fail

Weaknesses

Scalability Limitations: For extremely large-scale problems, memory requirements of Boolean encoding may become bottleneck
Parameter Sensitivity: Performance of hybrid scheme depends on hyperparameter selection, lacking automatic tuning guidance
Limited Comparisons: Primarily compares with traditional Markov basis methods, lacks comparison with other modern approaches

Impact

Academic Contribution: Opens new directions for cross-disciplinary research between statistical computation and combinatorial optimization
Practical Value: Provides practical tools for analyzing complex contingency tables
Reproducibility: Commits to open-source implementation, facilitating method dissemination

Applicable Scenarios

Analysis of sparse contingency tables with abundant structural zeros
High-dimensional constraint problems where traditional Markov basis computation is infeasible
Scenarios requiring rapid exploration of distant fiber regions
Exact conditional testing with small sample sizes

References

This paper cites important literature from statistics, combinatorial mathematics, and computer science, including:

Diaconis and Sturmfels (1998): Seminal work on algebraic algorithms for sampling conditional distributions
Modern SAT solver literature: CryptoMiniSat5, UniGen3, CMSGen, etc.
Statistical computation methods: Related research on Markov bases, dynamic bases, and lattice bases